SadServers Troubleshooting
Incident Response and Troubleshooting Tips
When you’re on call and an alert fires, work through these incident steps in order. Debugging or troubleshooting comes late—after verification, triage, communication, and mitigation, and just before the postmortem:
- Verify the issue
- Triage
- Communicate and escalate if needed
- Mitigate
- Troubleshoot
- Postmortem
These are general incident response and debugging techniques you can use:
- Do not make things worse, e.g., don’t randomly change things you are not familiar with; follow existing runbooks and procedures whenever possible.
- Communicate with the team. Take notes as you go, including what you observe and what you change. A chat medium like Slack is good since it also keeps a timeline (we want to have backup ways of communicating). Acknowledge other people’s messages.
- Coordinate so people don’t step on each other. Communicate intent and be specific (e.g., “going to restart the x database in y host”). Make sure everybody knows who’s got controls. Ideally, one person should lead the troubleshooting effort and make changes, while others support by checking things, communicating with Customer Support and other teams, etc.
- Try to divide the problem space, ideally in half. However, you don't need to start strictly in a systematic way if you have strong historical indicators pointing to likely causes.
- Change one thing at a time. Be mindful of what changes are easy to roll back and which are not. Follow good sysadmin practices like making a backup copy of a file before modifying it.
- Test first what has worked before. However, if you find yourself fixing the exact same issue repeatedly, that's usually a strong indicator that the underlying problem has not been properly addressed.
- Run the quickest tests that provide useful information first.
- If the initial quick tests fail, it's often a good idea to pause, step back, and restart debugging in a more systematic fashion. Revisit basic assumptions and validate your mental model of how the system is supposed to work with other people.
Troubleshooting Guides by Technology
Production troubleshooting tips, common failure modes, and diagnostic commands for Linux, DevOps and SRE technologies.