Table of Contents
- Executive Summary
- Incident Response in a Hybrid, Multi-Cloud World
- The Role of Automation in Incident Response
- Test Definition
- Findings, Benefits, and TCO Analysis
- Test Methodology
- Conclusion
- Appendix
- About GigaOm
- Copyright
1. Executive Summary
The way enterprise organizations respond to unexpected IT incidents is at a watershed moment. As organizations move their core operations onto SaaS and cloud-based platforms, they need to deliver and manage services in alignment with digital business delivery. Gaining visibility into issues, and time to respond, has become critical to business success.
Figure 1. Impacts and Savings from an IT Incident Management Automation Solution
This challenge is exacerbated by the way applications are built and deployed. Agile, DevOps-type environments are characterized by a “need for speed,” with the rate of deployment measured in days or even hours. Site Reliability Engineers (SREs) and other roles have emerged as a way of curating these deployments and managing any incidents, but these roles need to be properly informed and enabled to do their work.
The latest generation of incident management tools, such as Everbridge xMatters and PagerDuty Rundeck, go some way toward dealing with these challenges. In this report, based on field testing of both xMatters and RunDeck, we look at how these tools address incident reporting and response, how they bring automation into the mix to share information with stakeholders, and how they may automate any necessary resolutions. We found:
- Key components of a good incident response in a hybrid environment include standard integrations and workflows, automation for incident response, and the ability to define and experiment, do research, and improve the workflow if needed.
- Solutions in this space can have different areas of priority and focus. For example, some products concentrate on automating solutions with code, while others focus on no-code/low-code solutions for engineers who may not be developers. Either way, the idea is that the solutions to on-call alerts can be automated.
- The findings in this report demonstrate that the world of SRE and DevOps needs automated on-call/incident resolution, which is where xMatters excels.
Figure 2. Stages in the Incident Management Lifecycle
Figure 2 shows the incident management lifecycle, albeit without the after-action report (AAR) that may trigger a different process to start for the problem management lifecycle. xMatters and PagerDuty have the AAR delivered in the postmortem problem process/report.