Varnish monitoring

Tags: ops (28)

There are multiple tools and metrics available to monitor a Varnish installation. This tutorial provides information on important counters that will assist in monitoring vital aspects of a Varnish installation.

There are various commands that can be used to access additional information. Each option will be explained the first time it’s encountered, but you can find a more complete description in the man page of each tool.

varnishlog

varnishlog is in charge of presenting transaction logs in the most verbose way possible. This is a prime source of information for debugging.

Save all the current logs into a file:

# -d: dump all the logs present in the buffer, then exit
# -g raw: don't group records (lines) by transactions, just grab everything
# -w FILE: don't print logs, save them to FILE instead
varnishlog -d -g raw -w tmp/varnishlog.raw

Slow client responses:

# -g request: group logs by request, including dependencies.
#             Useful to link a client request to a backend
# -q QUERY: only show transaction matching QUERY, in this case,
#           responses that took more than a second to be
#           delivered
varnishlog -d -g request -q "Timestamp:Resp[2]  > 1.0"

Slow backend responses (more than one second to read):

varnishlog -d -g request -q "Timestamp:Beresp[2]  > 1.0"

Requests that spent any time on the waiting list:

varnishlog -d -g request -q "Timestamp:Waitinglist[2]  > 0.0"

Backend failures, with 5XX responses, and slow responses:

# -r RATE: each minute, only show at most 100 transactions
varnishlog -d -g request -q "RespStatus ~ '^5' or Timestamp:Resp[3] > 10.0 or Error" -R 100/1m

varnishadm

The varnishadm utility establishes a CLI connection to varnishd (Varnish daemon). The following are useful commands to troubleshoot a Varnish instance via varnishadm:

Show an overview of Varnish runtime parameters:

varnishadm param.show

Display the ban list, containing the ban expressions that are used to invalidate the cache:

varnishadm ban.list

Return Varnish panics:

varnishadm -- panic.show

Display the health of the various backends in Varnish:

varnishadm -- backend.list -p

varnishstat

The varnishstat utility collects and displays counter and metrics of a Varnish instance since startup time.

Run varnishstat once and exit:

# -1: print counter to stdout, instead of using the interactive interface
varnishstat -1

varnishstat also accepts filters that can be applied as follows:

# -f GLOB: only show counter whose name match GLOB
varnishstat -1 -f 'MAIN.*'

Important Counters

Here’s a selection of important counters, but you can check the varnish-counters man page for the full listing.

man varnish-counters

MAIN COUNTERS (MAIN.*)

client_req

Number of parsable client requests received.

cache_hit

Number of cache hits.

cache_miss

Number of cache misses.

threads_limited

Number of times more threads were needed, but limit was reached in a thread pool.

n_object

Number of HTTP objects (headers + body, if present) in the cache.

n_lru_nuked

How many objects have been forcefully evicted from storage to make room for a new object.

bans

Number of all bans in the system, including bans superseded by newer bans and bans already checked by the ban-lurker.

fetch_failed

Backend content fetches failed.

sess_queued

Contains the number of sessions that are queued because there are no available threads immediately. Consider increasing the thread_pool_min parameter.

sess_dropped

Counts how many times sessions are dropped because varnishd hits the maximum thread queue length. Consider increasing the thread_queue_limit Varnish parameter as a solution to drop fewer sessions.

exp_mailed

Number of objects mailed to expiry thread for handling.

exp_received

Number of objects received by expiry thread for handling.

threads

Total number of threads being used by Varnish.

n_lru_nuked

Number of least recently used (LRU) objects thrown out to make room for new objects. If this is zero, there is no reason to enlarge your cache. Otherwise, your cache is evicting objects due to space constraints. In this case, consider increasing the size of your cache.

MSE COUNTERS (MSE.*)

mse.c_bytes

Bytes allocated.

mse.c_freed

Bytes freed.

mse.g_alloc

Allocations outstanding.

mse.g_bytes

Bytes outstanding.

mse.g_space

Bytes available.

mse.insert_timeout

Number of inserts that timed out.

mse.n_lru_nuked

Number of LRU nuked objects.

mse.n_lru_moved

Number of LRU move operations.

mse.c_memcache_hit

Stored objects cache hits.

mse.c_memcache_miss

Stored objects cache misses.

mse.g_ykey_keys

Number of YKeys registered.

mse.c_ykey_purged

Number of objects purged with YKey.

SMA COUNTERS (SMA.*)

g_bytes

Number of bytes allocated from the storage.

g_space

Number of bytes left in the storage.