Sunday 15 October 2017

Disaster with disaster recovery

With our migration yesterday going well it was supposed to be a quiet day today. David and I started work (remotely this time) around 08:00 as we wanted to knock off the remaining tasks quickly so that we had the rest of the day to ourselves. Silly us, it was never going to be that easy!

Turned out we had a whole world of pain waiting for us when we logged in. Something had changed overnight that was locking out a critical component which was then breaking everything else. Marvellous, bloody marvellous. Nobody owned up to changing anything but it was clear that something was now VERY broken.

We resolved our initial problem by fixing something that should have been set up correctly in the first place which made us pretty grumpy. On the plus side at least the last of the installation tasks was now completed. This just left this lockout problem that dragged on throughout the day. at 15:00 we had more people helping investigate but we were still far from a fix! In fact it was looking like we might also be facing a US disaster as well as a partial EU one.

By 18:00 we had a rough handle on what had happened and so we started to take remedial action  but it was sloow! Come 19:30 and it looked like we had our fingers on the culprit.It was a really painful process getting everything up and running but slowly but surely there was light at the end of the tunnel! So we've now rolled past 20:30 so I've been running support for a period of over twelve hours! Thankfully by 20:50 we were all done and everything seemed to be stable again! That was an experience I don't want to have again!


No comments: