PagerDuty is an alarm aggregation and dispatching service for system administrators and support teams. It collects alerts from your monitoring tools, gives you an overall view of all of your monitoring alarms, and alerts an on duty engineer if there’s a problem. It is used in lots of major organisations like Slack, Stripe, Github etc.
What is On-Call? 🤔
On-call is the process of having individuals/teams being available at all times as the first line of defence in case of alerts, errors, failures or any abnormal behaviour in your product/service. On-calls are now a widely accepted and fairly common practice followed in the Software Industry.
The amount of resources allocated to On-calls depends generally on the criticality of the product/service your team owns. It can vary from a dedicated team of SREs (or a devOps team) managing on-calls for your team throughout to a couple of developers being on-call in rotation.
Being On-call means that you could be contacted at any time to investigate & fix any issues that might arise in the service you own. However, the primary function still remains to respond and acknowledge a problem (even if you cannot fix it). Once the problem is investigated, you can ask for more resources/team members to fix it.
Alerts on PagerDuty
PagerDuty sends alerts as “Pages” (Dr. House fans need no further explanation 😄). A Page traditionally used to be sent via a Pager, and is still very commonly used in many medical institutions even today.
A Page can be of different urgencies, depending on your configurations:
- High-urgency notifications, escalate as needed
- Low-urgency notifications, do not escalate
- Dynamic Notifications based on alert severity
- Based on support hours
- Secondary Schedules: Backup Schedules are always a good idea. It means two people On-Call at once. The Primary resource has a specific contact instead of looking through random team members in case of a need of additional resources.
- Tertiary Schedules: The third level of escalation should probably be on a wider level (Team/Business Unit/Org level). This level should be for very high level of alerts, and ideally should never be triggered.
- Managers: Mangers should also be involved in the On-Call Schedules. This helps them get some very valuable insights on the team.
- On-Call Roster: The On-Call Roster should not be too long. Also, as many team members should be involved in the roster so as to reduce load & increase overall understanding of the system.
- Notifications: Notifications should be enabled for as many modes as possible. Team members should add as many modes(email, sms, call etc.) & keep their contact information updated at all times.
- Leaves: On-Calls can be very stressful. Make sure to work only on On-Call responsibilities during your schedule. Also make sure to take a leave on the next working day if your On-call schedule goes through weekends /holidays.
Responsibilities of an On-call Resource
The responsibility of an On-call resource might depend on the basis of the organisation & team. Some of the broad on-call responsibilities are :
- Availability: You have to be available at all times during your on-call schedule (Yes, even at 2am 😅). In fact, some of the more critical alerts might come at odd times if your product/service is used in different time zones(high load) or if you run batch/nightly jobs during odd hours.
- Acknowledgement: This is one of the most important functions of an On-Call resource. You’ll have to acknowledge whenever you receive an alert. This is done so that no alert goes unnoticed. In case you miss an alert it might lead to an alert on your backup resource or a Business Unit/Org level alert.
- Analysis: This is again very important. You should be able to classify an alert in terms of priority & the amount of time/resources it would take to fix.
- Fix: You should have enough understanding of the problem to fix it. If it requires more teammates to deploy the fix within a reasonable timeframe you should involve them. If the alert is something you do not have much context on, or is out of your expertise you should escalate it quickly. Sometimes the alert may be of a very less impact. Even in that case, you should create a JIRA for it to be tracked & fixed at a later point of time.
- Improve: During an On-Call, you are also expected to monitor the performance of your systems. You should look out for processes that could be improved and work on getting them faster. Also, create a JIRA if the optimisation will extend the scope of your On-call and might need more thought.
- Support: You should be empathetic to the On-Call resource next to you. Pass down your learnings and context during switching. Also get them up-to-speed on the current alerts which they might have to work on.