September 16, 2014
4 min read time

Blog for a Sysadmin - Monitoring Health in Varnish Cache

At Varnish Software, we like to share tips and tricks and ensure our knowledge is being shared with our readers. In what I hope will become a series under the guise of 'Blog for a Sysadmin', I'd like to take you through the essentials of maintaining your Varnish Cache setup. First up—Monitoring your Varnish Cache setup.

Monitoring Health in Varnish Cache

Varnish Cache should be monitored in any serious environment it is part of. Both health monitoring to report any immediate problems, and trend monitoring to gather data to base decisions on. This blog post will cover the first topic, health monitoring.

My approach to health monitoring is that all components in the stack should be monitored separately to make it really easy to see exactly which component that has a problem. In addition, I like to have dedicated business logic checks that monitor the entire chain of components on the side to verify that the components are playing nice together. The latter is outside the scope of this blog post.

The very first thing I’d like to know is if my Varnish Cache is running and able to handle HTTP requests properly, without having the backends as part the equation. The following VCL snippet makes sure that the URL /varnish-status always returns 200:

sub vcl_recv {
   if (req.method == "GET" && req.url == "/varnish-status") {       
       return(synth(200, "OK"));
   }
}

Varnish 3 equivalent:
sub vcl_recv {
   if (req.request == "GET" && req.url == "/varnish-status") {
       error 200 "OK";
   }
}

The URL may be protected using ACLs if you don’t want to expose /varnish-status for the world.

When this VCL is in place, I use an HTTP check in our monitoring system, for example the check_http plugin [1], to check that /varnish-status responds 200 on the port which varnishd is listening on. This check can run often and have a short timeout as it is very cheap to handle these requests for Varnish. Suggested command:

check_http -H localhost -p 80 -u /varnish-status -e 200 -w 1 -c 2

The second thing I’ll add is the check_varnish plugin [2]. It makes it possible to define warning and critical thresholds from any of the counters that are available in varnishstat. The counters that may be important in the aspect of health monitoring are:

  • MAIN.sess_dropped

This counter (called n_wrk_drop in Varnish 3) will show the number of requests that have to be dropped because no more threads were available to handle them. This might be difficult to detect this without monitoring it properly. I’ll start out with the warning threshold at 0, and critical threshold at 5 to get a notification early, but not immediately. Suggested command:

check_varnish -p MAIN.sess_dropped -w 0 -c 5

  • MGT.child_panic

This counter (not available in Varnish 3) will count the number of times the child has paniced. The master process will restart the child immediately when it happens, and the cache will be flushed. Depending on the backend load, this may or may not be critical. In any case I’d like to know about it. Suggested command:

check_varnish -p MGT.child_panic -w 0 -c 2

  • SMA.Transient.c_fail

This counter indicates that the operating system is unable to allocate memory as requested. If this happens, the OOM killer is likely to strike any second. Suggested command:

check_varnish -p SMA.Transient.c_fail -c 0

With these steps in place you also may sleep well at night, and be confident that your monitoring system will wake you up if any immediate problems occur to your Varnish Cache. 

My next blog post will cover trend monitoring, which is useful when doing resource planning and tuning of our Varnish Cache instances. Please comment below if there's anything you'd like to see more of in the 'Blog for a sysadmin' series.

[1] https://www.monitoring-plugins.org/doc/man/check_http.html
[2] https://github.com/varnish/varnish-nagios/