Incident Response Guidelines

What is an incident response?

For our purposes, we are describing an incident as any extended outage to a system or service under our control as well as any security breach that may involve MIT data on said systems.

An incident response is no-blame documentation about the root cause of the incident, how we handled the response, and what we can do to prevent something similar from happening again. The intent is to make it clear what happened and how we responded.

What requires an incident response?

  • Any outage that lasts more than 45 minutes.

  • Every suspected security breach, regardless of whether an outage was involved or if it was determined after the fact that no security breach occurred.

Who should write them?

The report should be initially created and shared with other responders as soon as the triggering criteria is met by whoever is currently investigating. However, writing full details of the report should not take precedence over fixing the problem. It is likely we can find a balance between documenting and fixing in which our documentation actually helps us with the fixing and in fact does not block getting a solution in place even though it may feel that way at times. Often details of a report will be filled in after the problem is resolved by referencing email, logs, and slack discussions, and other people involved for more information.

Whenever we note in the timeline someone else’s name, that person should be expected to participate in the writing and reviewing of the report. The report becomes the shared responsibility of everyone who is named in the report.

What goes in a report?

The report template provides an overview of what should be included, but the general idea is:

Issue Summary: Provide an overview of the issue.

Timeline: Provide a detailed timeline of the incident, include names of staff that were involved and invite them to directly provide additional details to the document as soon as you name them.

Root Cause: Describe what caused the problem (this may not be known when the report is initially created, but by the end it hopefully will be). Don’t hide or downplay the root cause for any reason.

Resolution and recovery: Describe how the problem was resolved.

Corrective and Preventative Measures: Describe how the problem might be avoided in the future or how we may have responded differently. This may list specific changes that will be implemented immediately as well as future projects to replace or improve problem components.

See also: How to write an Incident Report / Postmortem

Where do I get a template?

Incident Response Template

How do I submit my completed report?

  1. Put it in the Google Drive folder

  2. Email it to TLT

  3. Share it on slack with appropriate teams

Additional reading: