Log in

No account? Create an account
27 June 2005 @ 08:24 pm
Chasing 9s: modeling and measuring availability of J2EE applications  

How to make it 5 9 available?

Sun published white paper on this topic.

Availability: mttf/(mttf+mttr)
mean time to failure
mean time to recover

Failover to another app server can count as recovery if and only if all state can be recovered.

Get to 24x7 stability befre testing for 5 9s.

Inject artificial failures to observe recovery modes.

App server failure rate: 52/year
os failure rate: 1/year
hardware rate: 1/year
failover recovery: 5 seconds
short restart: 90 sec
long restart: 1 hour
restore time: 30 minutes

Test at load to 60-70 percent of CPU utilization.

Marco model.

calculating 5 9s requires a reasonable model of failure modes, frequencies, and recovery time.

5 9: 5 minutes/year
4 9: 1 hour
3 9: 9 hours

Sun tried to implement an API for rolling upgrades with DB schema changes, gave up. Reccommends a "shadow cluster" solution, admits it is ugly. Will detail better when I am not working from a cellphone.

Details, as promised:
  1. Set up physically separate "old" and "new" data sources. Migrate existing data to the new datasource, and set up some sort of a marker so that data added or modified in the "old" datasource can be migrated later.
  2. Set up a second cluster of application servers, pointed at the new data source. Give these the new application code, and make sure they work.
  3. Activate the new application server cluster in a sufficiently smart load balancer.
  4. Have the load balancer "gracefully" shut down the existing application server cluster, giving a reasonable limit for existing sessions.
  5. Shut down old applications ervers when they aren't being used any more.
  6. Migrate any new or modified data in the "old" data source into the "new" data source.
  7. Set aside the formerly-production hardware. You could add it into the production pools with updated code, or use it as a QA/staging environment until the next such rollout. In this way, you would have TWO production-grade environments, which trade places with each schema-incompatible rollout.
Ryansynx on June 29th, 2005 05:23 am (UTC)
how i've done this
Forward and backwards compatible changes. No single 'flag day' without proven path forward.

How to handle database changes:
- new tables: no problem
- new columns: must be nullable.