We often check Grafana alerts for quick, helpful information to help us determine what is happening at a high level on a Sourcegraph instance. It's often our first line of defense when we suspect resource allocation issues. However helpful they can be, they can sometimes be deceiving.
Recently, a customer reported that Code Hosts were failing across their instance. When I first checked Grafana, I found that no alerts were firing, which was odd. Usually, I'd expect something. I checked Grafana multiple times throughout the debugging process and never saw so much as a warning level alert.
Due to some excellent instincts by one of our engineers, we were able to determine that this was an issue with CPU overload on our Redis containers. After solving the issue, I began investigating. We have critical alerts configured for CPU overload on our Redis containers, but nothing ever fired. Why?
The Culprit: Jitter
Jitter is a measure of noise variance around a set threshold. The below chart will help illustrate:
Above is the Grafana chart for the customer's Redis container that measures CPU usage over 24 hours. Make special note of the "valleys" when the signal is temporarily below the set threshold.
My theory is that every time I looked at Grafana, I happened to be checking when the service had backed off, preventing any alerts from firing and, ultimately, preventing me from seeing them.
How do I avoid being tripped up in the future?
If you suspect something is amiss and that you should be seeing an alert, you can use the alert history in Grafana to get a better grasp of what alerts are firing and when.
You can see this with Redis, for example, by visiting Grafana and selecting Dashboard > Manage > Redis... then clicking the dropdown on the alerts panel and selecting Explore. If your suspicion is correct, you should be seeing a list of fired alerts, like this:
You can also set the time range to whatever you'd like using the dropdown in the top right-hand corner of your Grafana instance:
Learn more about jitter here.