Planned System Downtime

I have some thoughts on planned downtime.

Ultimately it boils down to this: is the planned downtime caused by urgent, impending, and guaranteed doom? If yes, then by all means have the downtime whenever you can. If not, affected as few customers as possible, by the numbers.

Planned system downtime should minimize impact to system usage - period. By "System Usage" I mean real customers using the system.

I recently had a discussion with the systems group about a necessary outage on a system I support. Something important, but not urgent had the possibility to cause downtime on the system. However, if something went wrong and we had to restart a server, the server might not come up.

This was deemed as needing pretty serious attention (which I agreed with). They found a fix and wanted to put it into production, as an emergency, planned downtime. We had a downtime window coming up, but this should probably happen before the window. They suggested an evening.

Google Analytics showed that the evening still had a few hundred users on the system, even late. I suggested the morning where we might only affect a couple dozen. I was told this is not a "reasonable" time. I even got some flack for having the only system where reasonable downtime windows cannot be scheduled.

It seems to me they were looking for what was more convenient for them.

Now, that is a bold statement which is full of bias and doesn't represent them well - and that's not fair. They are operating under their real and genuine constraints and their real and genuine pressures. They have lots of other systems which don't operate like this one and need to put this one in the same box as the others so they can bring order to the chaos of their world. I get that. I have to do that myself.

But in this case, it's not right.

This system is our portal. It is the place where everyone goes to start their official business with us. Sure, it gets heavy usage during working hours, but it is also where college students do work - and they don't abide by working hours.

The Google Analytics data showed that a few hundred customers use the portal in the evening and only a few dozen use it during the mornings before work.

Based on past experience, the server guys said we could ask our direct customers if there was anything serious going on. Then we could make appropriate announcement so everyone would know about the planned outage.

My though is that when any customer using the system is a critical time - for them.

Sure we have notification mechanisms. We can alert all of our direct customers about the outage, so they can pass the message to their users and stakeholders - although that never happens. We can alert the help desks, so they know it is happening - assuming the affected customers report the issue. I learned something about this implementation: most people don't read the communications from your communication plans. They probably get a general idea of what is happening, not the details.

And sure we will affect some customers one way or the other, that's just the reality. However, one customer's critical experience times thirty is far less than times three hundred. We know the numbers.

I think I am going to win this time, but lose the war. Culture is hard to change.

Comments

Popular posts from this blog

Integrated Windows Authentication with Chrome and Firefox

Error Handling