DevOps teams and site reliability engineers (SREs) contend with a never-ending flood of notifications and alerts about outages, potential threats, and other incidents. Companies rely on their DevOps teams to not only keep abreast of all the notifications but also to identify and prioritize the critical alerts and resolve problems in a timely manner. Yet in 2021, International Data Corporation (IDC) reported that companies with 500-1,499 employees ignored or failed to investigate 27% of all alerts. The phenomenon that accounts for these missed alerts is called alert fatigue, and it’s an issue that many companies grapple with.
What is alert fatigue and why is it a problem?
Alert fatigue occurs when operators receive an overwhelming number of alerts and notifications. It’s easy to respond to a single occasional alert, but the complex technologies and services that keep systems running and monitor for cyberattacks more often produce numerous alerts in quick succession. DevOps teams can quickly end up with too many alerts to manage, complicated by the fact that many of the notifications may be false alarms or duplicate incidents. Over time, operators may become desensitized to alerts and start to miss important issues as they ignore alerts or respond more slowly.
The biggest risk of alert fatigue is that operators will overlook important information. Constant bombardment with alerts makes it difficult or impossible to identify truly critical issues among the waves of notifications. Operators may develop a habit of silencing alerts that actually require investigation just to keep up. The result is missed alerts and slow response times. When operators cannot effectively isolate, triage, and respond to real issues, companies cannot resolve expensive incidents like extended service outages and malicious cybersecurity attacks.
Another problem is that alert fatigue can cause poor employee retention. Receiving too many alerts to handle or frequent false alarms eventually causes burnout. Operators and engineers who are responsible for incident response already have stressful jobs, and alert fatigue adds the psychological burden of constantly losing ground by trying to handle an ever-growing volume of alerts with insufficient resources. Alert fatigue drives away valuable DevOps talent.
What causes alert fatigue?
Your observability and alerting solution should identify potential issues in your system and send alerts to the appropriate responders. Because you don’t want to overlook any issues or miss an incident, you might default to sending as many alerts as possible. But alerts are not equal: some incidents are more critical than others, and some notifications may not even require a response. As you monitor more devices, services, and systems, the number of alerts grows and it becomes more difficult to sift through the alerts, prioritize the most critical issues, and promptly respond to resolve them. This eventually causes alert fatigue, with overlooked or ignored alerts.
How to prevent alert fatigue
Using the right tools can go a long way toward preventing alert fatigue, but there are some general best practices for monitoring and alerting that you can follow no matter what tooling you use.
- Avoid sending alerts for events that are not actionable or have no impact. Adjust alerting thresholds as you gather more information and experience, and use logging instead of alerts for non-actionable events.
- Hold non-urgent alerts until regular working hours. Consider whether alerts for issues that do not affect service availability or performance can be postponed until the next working day.
- During major outages, silence alerts so that operators can focus on resolving the incident instead of responding to alerts for issues they already know about.
- Make sure your on-call schedule has enough coverage so that operators have time to recover between on-call shifts. Also, make sure that alerts go to the right teams so that operators only receive alerts that are relevant to their area of responsibility.
- Make alerts actionable. Alerts should include enough context for operators to understand the issue and the actions required to respond and resolve it. Develop specific response playbooks and procedures, especially for common incidents.
- Aggregate the results of your monitoring events so that your alerting system sends fewer alerts. Redundant alerts add to the noise that prevents operators from focusing on the most important issues. Consolidate alerts to remove duplicates whenever possible.
- Automate the alert response whenever possible. Use tooling to prioritize alerts before they are sent. Build scripts to run information-gathering and troubleshooting tasks so that operators only need to step in when the automated response cannot resolve the problem.
- Improve your system’s security and reliability—fewer issues means fewer alerts! Take a proactive approach and focus on minimizing attack surfaces and eliminating common issues that repeatedly cause alerts.
- Review your alerting process at regular intervals and adjust alerting thresholds based on your experience. Look for ways to refine your approach to monitoring and alerting to make things more efficient and help operators respond more effectively.
Avoiding alert fatigue with Sensu
The Sensu observability pipeline delivers advanced alert management with consolidation and deduplication capabilities to help incident response teams respond to and resolve critical issues without contributing to alert fatigue.
Sensu’s auto-registration and deregistration feature eliminates noise that traditional discovery-based monitoring systems produce. Sensu agents automatically discover and register ephemeral infrastructure components and the services running on them. When an agent process stops, the Sensu backend can automatically process a deregistration event. This automatic registration and deregistration keeps your Sensu instance current and ensures that you receive timely observability event data without stale events or alerts for entities that no longer exist.
Use Sensu’s event filters to evaluate filtering expressions against the data in observability events to determine whether the events should be passed to a handler. For example, you can create filters that prevent alerts during specific hours or after a specific number of occurrences and consolidate alerts so that they are sent once per hour. Or you can use the occurrences and occurrences_watermark attributes in event filters to fine-tune incident notifications and reduce alert fatigue. You can even use event filters to manage contact routing so that alerts go to the right team using their preferred alerting method.
Sensu events contain information about the affected services and the corresponding check or metric result. The observability data in Sensu events translates into meaningful, context-rich alerts that improve incident response and reduce alert fatigue. Sensu also offers event-based templating that allows you to add further actionable context and summary template arguments to make sure alerts include the information your operators need to take action.
The Sensu Catalog is a collection of integrations for monitoring and alerting services like BigPanda, PagerDuty, and ServiceNow. The Catalog helps teams configure alerting based on the application performance metrics that they’re responsible for. Integrations include quick-start templates that provide a straightforward way to set alerting thresholds and adjust them as needed to maintain the most effective alert policy.
Sensu also offers alert consolidation that helps operators monitor multiple systems and correlate across them and deduplication for grouping repeated alerts into one incident. Sensu’s Business Service Monitoring feature allows you to maintain high-level visibility into your services and customize rule templates that apply a minimum threshold for taking action as well as the action to take. Use the built-in aggregate rule template to treat the results of multiple checks executed across multiple systems as a single event.
Sensu offers many solutions for customizing alerting policies to focus on high-priority incidents, making sure alerts get to the ideal first responder, eliminating notification noise from recurring events, and preventing alert fatigue while reducing the time it takes to respond and recover.