Are you ready for an outage like the NOTAM Incident?

Are you ready for an outage like the NOTAM Incident?

It is highly likely that Incident Response and Disaster Recovery planning and preparation could have avoided or reduced the system outage's impact.

An FAA System Outage left passengers all over the US stranded on January 11, 2023. After several days the cause of the outage was determined to be a corrupt database file which occurred when the human personnel failed to follow procedures during routine maintenance. This failure to follow procedures led the engineer to replace one file with another without realizing a mistake had been made. The NOTAM system stopped processing updates on January 10 at 3:28 PM, and the first notice issued by FAA about the incident was on January 10 at 7:47 PM EST. It was not until 7:30 AM EST on January 11 that the FAA ordered the pause on all outgoing domestic flights. Outgoing flights could not take off for up to an hour and a half. At around 8:30 AM EST, flights began departing again when the FAA terminated the NOTAM outage advisory. This incident was caused by a single person doing regular maintenance without following documented procedures. The incident has been estimated to cost "millions of dollars."

How could prepared, planned, and tested Incident Response and Disaster Recovery plans have helped lessen the impact and scope of this "cascading series of IT failures" that culminated in the FAA outage? It is simple, part of Incident Response and Disaster Recovery plans includes identifying critical systems and applications. Once critical items are identified, plans are created to document how to respond to a failure or destruction of that system or application. The damage may be due to natural disasters, internal and/or external malicious actors, or employees who make a mistake with no malicious intent, such as with the FAA incident. 

Once the plans have been created, the next step is to test the plans. Testing can be done in many ways, from non-intrusive walk-throughs to fully involved failover testing. Testing the plans can lead to the discovery of gaps in the plan. If there is a catastrophic failure of a significant database during testing and no failover or backup is available, that is identified, and work can begin to remediate that gap. If the database is housed on a virtual system, it can be as simple as creating a snapshot before maintenance. You can then develop procedures on how to manage the snapshots. How many snapshots are to be kept? How long are snapshots saved? What does your backup process look like for the server if the database is housed on prem or a data center on a physical server? Do you have a mirrored spare? Do you have the server backed up to tape? Where and how long are the tapes kept? How often are backups taken? Once that has all been sorted and documented the disaster recovery testing can be rerun to test the new procedures and verify the gap has been closed.

The other side of this is Incident Response planning and testing. This plan enables an organization to respond to an incident in a standardized and timely manner. It is critical to test this plan to ensure that the documented plan is both practical and to identify critical gaps. These gaps could be anything from misconfiguration of controls to miscommunication between groups or team members and could identify ineffective incorporation between tools.

When something abnormal is seen, a reporting process is documented and stored in a location where all employees can access it to report what they see. The incident response plan includes documentation of the appropriate contact when something needs to be reported. Also documented for the incident response team are escalation processes, procedures, and personnel with contact information. The Incident Response Team can quickly and easily identify the correct escalation path once the initial investigation has been completed and it has been determined that a threat has been identified.  

Incident Response plans include sections documenting how the team identifies, communicates, contains, eradicates, and recovers from the incident. The final piece of the Incident Response Plan is the lessons learned session. Lessons Learned allows the team to review what happened, how it happened, what went right, and what went wrong to document any needed improvements.

Moser’s team of certified security professionals is ready to review and test your existing documentation or help create Incident Response or Disaster Recovery plans. Contact us via our web form.

Previous
Previous

Is Data Analytics a Good Career?

Next
Next

What Is the Purpose of Data Management?