Friday, September 2, 2011

Let's talk about Adaptive Diaster Recovery

Coming into Thursday, I had a full load of appointments to run some updates and upgrades. Then the dreaded emergency call.. The 7:48am call and voicemail that abruptly woke my wife and I up.... Checking out, the call was from one of my clients in Albany (2.3-3.5 hour drive).

Add Murphy's Law with my Internet being down, it seems like today is not going to be a good day.. Luckily, the disaster was not as much of a disaster after executing an actual disaster recovery plan..

Though the disaster recovery took 3 hours (most of my recovery go from 15 mins to 5 hours depending upon the type of back up solution, type of disaster, and amount of data that needs to be restored), we had a partial recovery... Well, it was more like 99.999% because the software corrupted one data file and the last time stamp on CDP backup was on 11:00am (so we lost all the data changes on that file from 11:00am to 4:00pm).. Talk about griefs of all the data entry changes that was lost for that file (it is one of the most heavily used files). Secondly, why was there no other updates or changes?

So we went back to our DRP plan (though this was not a drill, they have experienced and tested it out)... And here we get to the article and additional notes to consider the disaster plan:

Do You have a Disaster Recovery Plan (DRP)?
http://netsecurity.about.com/od/disasterrecovery/a/Do-You-Have-A-Disaster-Recovery-Plan-DRP.htm


Like my previous post, the article hits the same detail with a little more in depth of defining the critical files, how to backup, your plans, downtime, etc. For some, it might be enough. But even with a decent thought out plan, there some things you can't control..


What was the problem was the application software actually generated a corrupted the file but was still accessible and "functional". But due to the corruption, it was not write-able on the backup at 6pm (though I am trying to figure how it was read, access, and modified).



Like GIGO (garbage in, garbage out), the corrupt could be updated into your data set and you have a problem (depending upon type of backup). Also, since the application did not warn of any problem, the error on the file could potential crash the software at any time....


As the article points:
"Whichever method you choose, make sure you set a schedule to backup all your files at least once weekly, with incremental backups each night if possible. Additionally, you should periodically make a copy of your backup and store it off-site in a fire safe, safe deposit box, or somewhere other than where your computers reside. Off-site backups are important because your backup is useless if it's burned up in the same fire that just torched your computer."

Even with the setup, we would have to go back to the last "complete" backup (which would have been the day prior). But all the entry on the other aspect of the software is lost... So you see the issue at hand... What would you do?

The amazing thing that is forgotten in all disaster plan is Know your tools at hand.. The software has an amazing audit trail of all entries (update/delete) and since that file did not get damaged.. You could "duplicate the error". Could we go back 1 day prior? Yes, with just one file, it was not worth going back 1 whole day. It was best to back half-day and "rebuild and data enter" the data set to the current.

As the article states:
  • Support phone numbers (for ISP, PC manufacturer, network administrators, tech support)"


Additionally, there was a contingency plan to consider how the office will function without data. Luckily, they are keeping some form of hard copy and they have electronic journal in the software. Both would have worked and they were able to back track the information.

Lastly, they had accounting tools at hand from the day prior, and the audit trail verified the numbers and was no issue. The software support added the last tools to ensure the data is not corrupted in other aspect and the team confirmed the entries were correct. Though a lot of mental and time-consuming validation. They were up in 3 hours!!!!! Added this involved multiple stakeholders.

What impressed me was that the CDP provided all increments and differential on the file (except the time of corruption). They do not back up every second but they update 1-2 hours or no activity.

Though I was not back for my whole day appointment til 9:00am (yes, I had a 24 hours work day schedule), I am glad that the office had a disaster, respond to the disaster, and execute their plan flawlessly... Lastly, any out of control matter... We just managed it and develop a solution with some thought and executed it... That is adaptive and creative thinking..

No comments:

Post a Comment