Incident management is a process used by IT Operations and DevOps teams to respond to and address unplanned events that can affect service quality or service operations. Incident management procedures aims to identify and correct problems while maintaining normal service and minimizing impact to the business. As well as root cause analysis and postmortem analysis helping the problem to be correct classified and preventing it manifestation in the future.
Our systems will never be 100% secure. We may to classify incident response a set of procedure that an investigator follows when examining a computer security incident. Incident management consists of the monitoring and detection of events on a network and execution of proper responses to those events.
Incident management procedures include next common steps:
Step | What does it mean? |
Preparation | In this step we are going to be sure if organization has well defined incident management procedures and strong security posture. And appropriate senior management person why is able to limit a damage if incident occurs. The basic idea is we need to be prepare before incident happens. |
Identification | Process of recognizing whether an event that occurs should be classified as an incident. In this step we have to ensure if event is minor or larger issue which might be classified as an incident. |
Containment | Containment is focused on isolating the incident and underline system or network segment. |
Eradication | This is important step during security incident response process. Understanding recovery from security incidents requires the full removal of any malicious code or other threats that were introduced to the environment during the incident. This is the purpose of the eradication phase. |
Recovery | Focused on data restoration, system repair, and re-enabling any services or networks taken offline during the incident or incident response phase. |
Root cause analysis | The process of discovering the root causes of problems in order to identify appropriate solutions. |
Postmortem analysis | This is a process intended to help you learn from past incidents. It typically involves an analysis or discussion soon after an event has taken place. |
This article I would like to emphasize your attention on two last steps of incident analysis and response like root cause analysis and postmortem discussion.
Root cause analysis (RCA) is a critical element of incident management. It helps us understand and think about common problem. The three main goals of RCA:
- To discover the root cause of a problem or event;
- To fully understand how to fix, compensate, or learn from any underlying issues within the root cause;
- To apply what we learn from this analysis to systematically prevent future issues or to repeat successes.
To provide correct RCA we have to be sure that our team:
- Focus on correcting and remedying root causes rather than just symptoms.
- Don’t ignore the importance of treating symptoms for short term relief.
- Realize there can be, and often are, multiple root causes.
- Focus on HOW and WHY something happened, not WHO was responsible.
- Be methodical and find concrete cause-effect evidence to back up root cause claims.
- Provide enough information to inform a corrective course of action.
- Consider how a root cause can be prevented (or replicated) in the future.
The postmortem process drives focus, instills a culture of learning, and identifies opportunities for improvement that otherwise would be lost. It typically involves blame-free analysis and discussion soon after an incident or event has taken place.
Let’s have a look at main ideological components of postmortem process:
- Avoid Blame and Keep It Constructive. Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization.
- Share Knowledge. In practice, teams share the first postmortem draft internally and solicit a group of senior engineers to assess the draft for completeness. It enables the rapid collection of data and ideas and essential during the early creation of a postmortem:
- Was key incident data collected for posterity?
- Are the impact assessments complete?
- Was the root cause sufficiently deep?
- Is the action plan appropriate and are resulting bug fixes at appropriate priority?
- Did we share the outcome with relevant stakeholders?
- No Postmortem Left Unreviewed. An unreviewed postmortem might as well never have existed. To ensure that each completed draft is reviewed, we encourage regular review sessions for postmortems. In these meetings, it is important to close out any ongoing discussions and comments, to capture ideas, and to finalize the state. Once those involved are satisfied with the document and its action items, the postmortem is added to a team or organization repository of past incidents. Transparent sharing makes it easier for others to find and learn from the postmortem.
- Facing this challenge:
- Ease postmortems into the workflow. A trial period with several complete and successful postmortems may help prove their value, in addition to helping to identify what criteria should initiate a postmortem.
- Make sure that writing effective postmortems is a rewarded and celebrated practice, both publicly through the social methods mentioned earlier, and through individual and team performance management.
- Encourage senior leadership’s acknowledgment and participation.
- Visibly Reward People for Doing the Right Thing
- Ask for Feedback on Postmortem Effectiveness. Regularly survey our teams on how the postmortem process is supporting their goals and how the process might be improved. Ask questions such as:
- Is the culture supporting your work?
- Does writing a postmortem entail too much toil?
- What best practices does your team recommend for other teams?
- What kinds of tools would you like to see developed?
- The survey results give the SREs in the trenches the opportunity to ask for improvements that will increase the effectiveness of the postmortem culture.
Be an ethical, save your privacy!