Troubleshooting Varnish

Tags: ops (14)

Varnish is incredibly stable, but that doesn’t mean troubleshooting doesn’t have to take place. On rare occassions crashes may occur. Based on the available server resources and traffic patterns, misconfigured runtime parameters may cause unwanted behavior.

Crashes rarely occur, but when they do, you need to figure out what’s going on.

Even when Varnish is stable, the hit rate of the cache might be too low for your liking.

There are many ways to perform troubleshooting: from panic inspection, to Varnish Shared Memory Logs (VSL), to monitoring metrics.

Restarting and reloading

“Have you tried turning it off and on again?" is a very common question asked by IT support staff and quite often it turns out to be an effective measure.

However, restarting is often considered lazy problem solving because it lacks the ambition inspect the root cause. But when you’ve run out of options or out of time, a quick restart can be considered.

Restart

Here’s how you can restart Varnish:

sudo systemctl restart varnish

This command will kill the varnishd runtime process and start it up again. Any runtime parameters that weren’t persisted in the systemd service file will be reset.

Reload

Performing a Varnish reload will only reload the VCL configuration and has no impact on the state of the varnishd main or child processes.

Here’s how you reload the VCL configuration:

sudo systemctl reload varnish

The systemd reload will call the varnishreload script that will gracefully add a new VCL configuration, while discarding older configurations.

Varnish panics

When the Varnish child process crashes a panic is triggered and the child process is restarted. The panic.show CLI command will return the last panic.

Here’s how to return the last panic:

sudo varnishadm panic.show

An example panic

Here’s an example of a panic that we triggered ourselves through the varnishadm debug.panic.worker command:

$ sudo varnishadm panic.show
Panic at: Mon, 03 Jan 2022 13:16:43 GMT
Assert error in ccf_panic(), cache/cache_panic.c line 725:
  Condition((strcmp(\"\", \"You asked for it\")) == 0) not true.
version = varnish-6.0.8 revision 97e54ada6ac578af332e52b44d2038bb4fa4cd4a, vrt api = 7.1
ident = Linux,5.10.76-linuxkit,aarch64,-junix,-sdefault,-sdefault,-hcritbit,epoll
now = 11645.069782 (mono), 1641215803.491569 (real)
Backtrace:
  0xaaaaca50fcd0: /usr/sbin/varnishd(+0x4fcd0) [0xaaaaca50fcd0]
  0xaaaaca57c5f4: /usr/sbin/varnishd(VAS_Fail+0x24) [0xaaaaca57c5f4]
  0xaaaaca50e474: /usr/sbin/varnishd(body_status_2str+0) [0xaaaaca50e474]
  0xaaaaca57ea1c: /usr/sbin/varnishd(+0xbea1c) [0xaaaaca57ea1c]
  0xaaaaca57ee54: /usr/sbin/varnishd(VCLS_Poll+0x2d0) [0xaaaaca57ee54]
  0xaaaaca4f1e28: /usr/sbin/varnishd(CLI_Run+0x58) [0xaaaaca4f1e28]
  0xaaaaca50ad10: /usr/sbin/varnishd(child_main+0x1a0) [0xaaaaca50ad10]
  0xaaaaca54d6c4: /usr/sbin/varnishd(+0x8d6c4) [0xaaaaca54d6c4]
  0xaaaaca57ea1c: /usr/sbin/varnishd(+0xbea1c) [0xaaaaca57ea1c]
  0xaaaaca57ee54: /usr/sbin/varnishd(VCLS_Poll+0x2d0) [0xaaaaca57ee54]
thread = (cache-main)
thr.req = (nil) {
},
thr.busyobj = (nil) {
},
vmods = {
},

The error message, the filename, the line number and the backtrace are the tools you have to figure out what caused the panic and how to potentially solve it.

You can get help when you cannot figure these panics out on your own.

Debug mode

The varnishd program has a -d flag that enables debug mode.

When enabling debug mode, the parent process will run in the foreground with a CLI connection on stdin/stdout. The child process must be started explicitly with a CLI command. Terminating the parent process will also terminate the child.

Enabling debug mode

Here’s an example where the -d debug flag is added to varnishd:

$ varnishd -f default.vcl -d
Warnings:
VCL compiled.

Debug: Version: varnish-6.0.8 revision 97e54ada6ac578af332e52b44d2038bb4fa4cd4a
Debug: Platform: Linux,5.10.76-linuxkit,aarch64,-junix,-sdefault,-sdefault,-hcritbit
200 318
-----------------------------
Varnish Cache CLI 1.0
-----------------------------
Linux,5.10.76-linuxkit,aarch64,-junix,-sdefault,-sdefault,-hcritbit
varnish-6.0.8 revision 97e54ada6ac578af332e52b44d2038bb4fa4cd4a

Type 'help' for command list.
Type 'quit' to close CLI session.
Type 'start' to launch worker process.

start
Debug: Child (927) Started
200 0

Info: Child (927) said Child starts

When to use debug mode

An occasional Varnish panic can be caught using the panic.show CLI command, as mentioned earlier. When the panic occurs, the child process will restart and request handling continues.

If multiple panics occur, the last panic will be overwritten and the original panic you’re trying to investigate will no longer be available through panic.show. To prevent this information from being lost, debug mode can be used.

Debug mode is very similar to running the varnishd program in the foreground via the -F runtime parameter. The main difference is that a CLI is available and that the child process is not automatically started.

Thanks to the CLI being available, you can run additional commands while debugging. These commands include:

  • param.show, param.set and param.reset to show, set, and reset runtime parameters
  • stop and start to stop and start the varnishd child process
  • vcl.load, vcl.use and vcl.discard to load, activate and remove VCL configurations

The Guru Meditation error

The Guru Meditation errors that Varnish returns are well-known, but usually have nothing to do with problems within Varnish.

As you can see in the example below, the Guru Meditation error reports that Varnish couldn’t reach the backend:

<!DOCTYPE html>
<html>
  <head>
    <title>503 Backend fetch failed</title>
  </head>
  <body>
    <h1>Error 503 Backend fetch failed</h1>
    <p>Backend fetch failed</p>
    <h3>Guru Meditation:</h3>
    <p>XID: 123456</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

The output format for these types of errors and the fact that the term Guru Meditation is used, is all part of the built-in VCL for the vcl_synth and vcl_backend_error subroutines.

The vcl_synth subroutine is reponsible for synthetic responses that are returned in the client-side logic of Varnish. For errors that happen at the backend-side, the vcl_backend_error subroutine is used.

Client-side synthetic responses

If you want to intercept Varnish Shared Memory Logs (VSL) for client-side synthetic responses, you can run the following command:

sudo varnishlog -g request -q "VCL_call eq 'SYNTH'"

If you want to intercept VSL transactions for synthetic responses that return 400-range status codes, you can run the following command:

sudo varnishlog -g request -q "VCL_call eq 'SYNTH'\
 and RespStatus >= 400 and RespStatus < 500"

This will ensure that non-erroneous synthetic responses are not part of the output.

These commands will present a VSL transaction every time a client-side Guru Meditation error takes place in real time.

If you want to dump all the old transactions in the VSL memory buffer that match these criteria, you can add the -d parameter to the command. Here’s an example:

sudo varnishlog -d -g request -q "VCL_call eq 'SYNTH'\
 and RespStatus >= 400 and RespStatus < 500"

VSL output example

Here’s a VSL example for a request that resulted in a synthetic response with a 400-range status code:

$ sudo varnishlog -g request -q "VCL_call eq 'SYNTH'\
 and RespStatus >= 400 and RespStatus < 500"
*   << Request  >> 8
-   Begin          req 7 rxreq
-   Timestamp      Start: 1641228924.312547 0.000000 0.000000
-   Timestamp      Req: 1641228924.312547 0.000000 0.000000
-   ReqStart       172.17.0.1 55068 a0
-   ReqMethod      PRI
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: localhost
-   VCL_call       RECV
-   VCL_return     synth
-   VCL_call       HASH
-   VCL_return     lookup
-   Timestamp      Process: 1641228924.312633 0.000086 0.000086
-   RespProtocol   HTTP/1.1
-   RespStatus     405
-   RespReason     Method Not Allowed
-   RespReason     Method Not Allowed
-   RespHeader     Date: Mon, 03 Jan 2022 16:55:24 GMT
-   RespHeader     Server: Varnish
-   RespHeader     X-Varnish: 8
-   VCL_call       SYNTH
-   RespHeader     Content-Type: text/html; charset=utf-8
-   RespHeader     Retry-After: 5
-   VCL_return     deliver
-   RespHeader     Content-Length: 272
-   Storage        malloc Transient
-   RespHeader     Connection: keep-alive
-   Timestamp      Resp: 1641228924.312984 0.000437 0.000351
-   ReqAcct        73 0 73 204 272 476
-   End

From these logs you can conclude that an HTTP 405 Method Not Allowed error was returned because the invalid PRI request method was used.

This behavior is part of the built-in VCL for vcl_recv that prevents the PRI request method from being used.

You can slim down the output by only selecting the tags you need:

$ sudo varnishlog -i RespStatus -i RespReason -g request -q "VCL_call eq 'SYNTH'
           and RespStatus >= 400 and RespStatus < 500"
*   << Request  >> 8
-   RespStatus     405
-   RespReason     Method Not Allowed
-   RespReason     Method Not Allowed

Backend errors

When the Guru Meditation error returns output like 503 Backend fetch failed, the vcl_backend_error subroutine is called because the backend fetch failed.

This is not client-side logic based on faulty input from the user, but backend logic for a request that was not served from the cache. In the execution flow the vcl_backend_fetch and the vcl_backend_response subroutines are responsible for fetching a response from the backend and storing it in the cache.

When things go wrong, the synthetic error is returned, but you can also catch these errors in the Varnish Shared Memory Logs (VSL) using the following command:

sudo varnishlog -g request -q "VCL_call eq 'BACKEND_ERROR'"

VSL output example

Here’s a VSL example where the backend fetch fails and a backend error is returned:

$ sudo varnishlog -g request -q "VCL_call eq 'BACKEND_ERROR'"
*   << Request  >> 32790
-   Begin          req 32789 rxreq
-   Timestamp      Start: 1641287864.851443 0.000000 0.000000
-   Timestamp      Req: 1641287864.851443 0.000000 0.000000
-   ReqStart       172.17.0.1 55092 a0
-   ReqMethod      GET
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: localhost
-   VCL_call       RECV
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       MISS
-   VCL_return     fetch
-   Link           bereq 32791 fetch
-   Timestamp      Fetch: 1641287864.852788 0.001346 0.001346
-   RespProtocol   HTTP/1.1
-   RespStatus     503
-   RespReason     Backend fetch failed
-   RespHeader     Date: Tue, 04 Jan 2022 09:17:44 GMT
-   RespHeader     Server: Varnish
-   RespHeader     Content-Type: text/html; charset=utf-8
-   RespHeader     Retry-After: 5
-   RespHeader     X-Varnish: 32790
-   RespHeader     Age: 0
-   RespHeader     Via: 1.1 varnish (Varnish/6.0)
-   VCL_call       DELIVER
-   VCL_return     deliver
-   Timestamp      Process: 1641287864.852818 0.001375 0.000030
-   RespHeader     Content-Length: 282
-   RespHeader     Connection: keep-alive
-   Timestamp      Resp: 1641287864.852978 0.001535 0.000160
-   ReqAcct        73 0 73 250 282 532
-   End
**  << BeReq    >> 32791
--  Begin          bereq 32790 fetch
--  VCL_use        boot
--  Timestamp      Start: 1641287864.851951 0.000000 0.000000
--  BereqMethod    GET
--  BereqURL       /
--  BereqProtocol  HTTP/1.1
--  BereqHeader    Host: localhost
--  BereqHeader    Accept-Encoding: gzip
--  BereqHeader    X-Varnish: 32791
--  VCL_call       BACKEND_FETCH
--  VCL_return     fetch
--  FetchError     backend boot.default: fail errno 111 (Connection refused)
--  Timestamp      Beresp: 1641287864.852372 0.000422 0.000422
--  Timestamp      Error: 1641287864.852377 0.000426 0.000004
--  BerespProtocol HTTP/1.1
--  BerespStatus   503
--  BerespReason   Service Unavailable
--  BerespReason   Backend fetch failed
--  BerespHeader   Date: Tue, 04 Jan 2022 09:17:44 GMT
--  BerespHeader   Server: Varnish
--  VCL_call       BACKEND_ERROR
--  BerespHeader   Content-Type: text/html; charset=utf-8
--  BerespHeader   Retry-After: 5
--  VCL_return     deliver
--  Storage        malloc Transient
--  Length         282
--  BereqAcct      0 0 0 0 0 0
--  End

The log output has 2 transactions:

  • One for the client-side transaction with the incoming request and outgoing response
  • One for the backend fetch and backend response

In the backend transaction, the FetchError tag clearly indicates a Connection refused error. The fact that the backend server is not available is the reason for the failure.

Of course the output can be reduced by only selecting the necessary tags.

Backend health monitoring

Instead of waiting for backend fetch failures to occur, you can actively monitor the health of your backends. This requires a health probe to be defined and assigned to the backends:

vcl 4.1;

probe health {
    .request =
        "HEAD / HTTP/1.1"
        "Host: localhost"
        "Connection: close"
        "User-Agent: Varnish Health Probe";
    .interval  = 10s;
    .timeout   = 5s;
    .window    = 5;
    .threshold = 3;
}

backend default {
    .host = "127.0.0.1";
    .port = "8080";
    .probe = health;
}

The probe named health will send an HTTP HEAD call for the / resource to the backend every 10 seconds. The probe will time out after 5 seconds. As soon as 3 out of 5 health checks fail, the backend is considered sick.

VSL backend health checks

The Backend_health VSL tag will return the backend health on a real-time basis. You can run the following command to display the backend health:

sudo varnishlog -g raw -i Backend_health

Here’s the output for the failing backend:

$ sudo varnishlog -g raw -i backend_health
0 Backend_health - boot.default Still sick -------- 2 3 5 0.000000 0.000000 Open error 111 (Connection refused)
0 Backend_health - boot.default Still sick -------- 2 3 5 0.000000 0.000000 Open error 111 (Connection refused)

As you can see, the same Connection refused error is returned. Because the interval of the health probe was set to 10 seconds, a new log line will appear every 10 seconds.

When the backend failure is fixed, the VSL will reflect the fix:

$ sudo varnishlog -g raw -i backend_health
0 Backend_health - boot.default Still sick 4---X-RH 1 3 5 0.003040 0.001013 HTTP/1.1 200 OK
0 Backend_health - boot.default Still sick 4---X-RH 2 3 5 0.005553 0.002148 HTTP/1.1 200 OK
0 Backend_health - boot.default Back healthy 4---X-RH 3 3 5 0.001177 0.001906 HTTP/1.1 200 OK
0 Backend_health - boot.default Still healthy 4---X-RH 4 3 5 0.001419 0.001784 HTTP/1.1 200 OK

Although the first log line already displays an HTTP/1.1 200 OK status, the backend health is not yet healthy. For the backend the be fully healthy, 3 out of 5 checks should succeed according to the health probe settings.

As of the third log line, the backend is considered Back healthy and as of the fourth log line the backend is considered Still healthy.

CLI backend health

If you don’t want to consult the VSL and want a backend health summary, you can call the backend.list CLI command.

As you can see in the output below, the default backend is considered healthy thanks to the health check probe that is monitoring this backend:

$ sudo varnishadm backend.list
Backend name         Admin      Probe      Last updated
boot.default         probe      Healthy    5/5 Tue, 04 Jan 2022 10:00:31 GMT

If the backend becomes sick, the backend.list command will reflect this:

$ sudo varnishadm backend.list
Backend name         Admin      Probe      Last updated
boot.default         probe      Sick       2/5 Tue, 04 Jan 2022 10:04:41 GMT

Note: by default, backend.list only displays the backends attached to the active VCL, which can be confusing if you use VCL labels, for example. To show all the backends, use the optional globbing pattern '*.*' (quoted to avoid shell expansion):

$ sudo varnishadm backend.list '*.*'
Backend name         Admin      Probe      Last updated
boot.default         probe      Sick       2/5 Tue, 04 Jan 2022 10:04:41 GMT
foo_vcl.backend1     probe      Sick       5/5 Tue, 04 Jan 2022 10:04:42 GMT
foo_vcl.backend2     probe      Sick       5/5 Tue, 04 Jan 2022 10:04:40 GMT
bar_vcl.default      probe      Sick       5/5 Tue, 04 Jan 2022 10:04:42 GMT

Varnish is not caching

Troubleshooting is not only about dealing with errors in Varnish. Sometimes the hit rate of the cache will be too low and the backend will be overwhelmed while Varnish is still in perfect shape.

Figuring out why Varnish isn’t caching pages it was supposed to cache can be done using specific VSL commands.

Cache misses aren’t usually that bad: a miss is just a cache hit that hasn’t taken place yet. Eventually the object will end up in the cache. However, if the TTL of your objects is too low or if the cache is full, cached objects will be removed from the cache.

Cache bypasses on the other hand have a bigger impact: these objects will never end up in the cache. This is either caused by built-in VCL logic that prevents data to be stored in the cache or served from the cache, or by custom VCL logic that has the same effect.

VSL commands

You could list all VSL transactions that resulted in a pass. The following command will help you with that:

sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
    -I ReqHeader:Cookie -I ReqHeader:Authorization \
    -i VCL_call -i VCL_return \
    -q "VCL_call eq 'PASS'"

You could also be looking at a specific URL and ask yourself the same question:

sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
	-I ReqHeader:Cookie -I ReqHeader:Authorization \
   -i VCL_call -i VCL_return \
	-q "ReqUrl eq '/'"

In this case we’re looking for requests that hit the homepage (/).

The output of these commands will tell us why Varnish wasn’t serving the response from the cache

Because it was a POST request

The following VSL output indicates that an HTTP POST call was used for the request:

$ sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
    -I ReqHeader:Cookie -I ReqHeader:Authorization \
    -i VCL_call -i VCL_return \
    -q "VCL_call eq 'PASS'"
*   << Request  >> 2
-   ReqMethod      POST
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: localhost
-   VCL_call       RECV
-   VCL_return     pass
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       PASS
-   VCL_return     fetch
-   VCL_call       DELIVER
-   VCL_return     deliver

And that’s the reason why this request was not served from the cache.

The built-in VCL for the vcl_recv will not serve objects from the cache if the request contains a Cookie header.

The VSL output below illustrates this behavior:

$ sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
    -I ReqHeader:Cookie -I ReqHeader:Authorization \
    -i VCL_call -i VCL_return \
    -q "VCL_call eq 'PASS'"
*   << Request  >> 3
-   ReqMethod      GET
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: localhost
-   ReqHeader      Cookie: lang=en
-   VCL_call       RECV
-   VCL_return     pass
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       PASS
-   VCL_return     fetch
-   VCL_call       DELIVER
-   VCL_return     deliver

Because an Authorization header was passed

The built-in VCL for the vcl_recv also bypasses the cache when an Authorization header is sent. The idea is that authorized content is private and should not be cached.

Here’s the VSL output for such a request:

$ sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
    -I ReqHeader:Cookie -I ReqHeader:Authorization \
    -i VCL_call -i VCL_return \
    -q "VCL_call eq 'PASS'"
*   << Request  >> 98327
-   ReqMethod      GET
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: localhost
-   ReqHeader      Authorization: Basic dGVzdDp0ZXN0
-   VCL_call       RECV
-   VCL_return     pass
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       PASS
-   VCL_return     fetch
-   VCL_call       DELIVER
-   VCL_return     deliver

Because of an invalid request method

In an earlier example we bypassed the cache for uncacheable request methods such as POST. Only GET and HEAD are cacheable.

But Varnish will also not cache if it doesn’t recognize the request method. When an invalid request method is used, Varnish will assume the input is not an HTTP request.

At that point the request will be piped to the backend. This means that a TCP connection is opened to the backend, all the incoming bytes are sent, and the response is returned. No effort is made to treat the request or response as HTTP.

In terms of the Varnish Finite State Machine, a pass call is no longer made and instead a pipe call is made.

Here’s how you query for these requests in VSL:

sudo varnishlog -g request -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
    -I ReqHeader:Cookie -I ReqHeader:Authorization \
    -i VCL_call -i VCL_return \
    -q "VCL_return eq 'pipe'"

And here’s some example VSL output:

$ sudo varnishlog -g request -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
    -I ReqHeader:Cookie -I ReqHeader:Authorization \
    -i VCL_call -i VCL_return \
    -q "VCL_return eq 'pipe'"
*   << Request  >> 22
-   ReqMethod      FOO
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: localhost
-   VCL_call       RECV
-   VCL_return     pipe
-   VCL_call       HASH
-   VCL_return     lookup
**  << BeReq    >> 23
--  VCL_call       PIPE
--  VCL_return     pipe

In this example the FOO request method was used, which results in a transition to the vcl_pipe subroutine.

Because of a Hit-For-Miss

When Varnish receives a response from the backend server, the vcl_backend_response logic will examine the cacheability of the response. If the response is not deemed cacheable, Varnish will create a Hit-For-Miss object that will allow the next requests to bypass the cache as well as the request coalescing waiting list.

As long as the backend response returns uncacheable responses, Hit-For-Misses will occur. We can spot these in VSL by looking for the HitMiss tag.

The following command will filter out all VSL transactions that contain a HitMiss tag.

sudo varnishlog -g request -q "HitMiss"

From a Finite State Machine point of view a transition from vcl_recv to vcl_hash is made. Although the requested object is found in the cache, it is a Hit-For-Miss object, which causes a transition to the vcl_miss subroutine.

A more targeted approach to find uncacheable content is to filter the logs for backend transactions that have uncacheable as one of the values in the TTL tag:

sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
	-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
	-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"

Let’s use this command, that only displays specific tags, to explore some uncacheable scenarios.

Because of a zero TTL

One of the potential root causes of a Hit-For-Miss is a response that results in a zero TTL. A TTL of zero can be set through the Cache-Control header or the Expires header.

Here are a couple examples of such header values:

Cache-Control: max-age=0
Cache-Control: s-maxage=0
Expires: Thu, 1 Jan 1970 12:00:00 GMT

Here’s some VSL output that illustrates this scenario:

$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
	-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
	-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
**  << BeReq    >> 32810
--  BereqURL       /
--  BerespHeader   Cache-Control: max-age=0
--  TTL            RFC 0 10 0 1641298426 1641298426 1641298426 0 0 cacheable
--  TTL            VCL 120 10 0 1641298426 cacheable
--  TTL            VCL 120 10 0 1641298426 uncacheable
--  BerespHeader   Vary: Accept-Encoding

You can clearly spot the Cache-Control: max-age=0 response header that was sent by the backend. And although the first occurence of the TTL tag sees this as a cacheable response, the built-in VCL for this scenario will turn this into an uncacheable response and creates a Hit-For-Miss object as a result.

Because of private, no-cache, no-store

Another scenario where a response is deemed uncacheable is when the Cache-Control response header contains keywords like private, no-cache or no-store.

Here’s some VSL output for this scenario:

$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
	-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
	-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
**  << BeReq    >> 43
--  BereqURL       /
--  BerespHeader   Cache-Control: private, no-cache, no-store
--  TTL            RFC 120 10 0 1641298797 1641298797 1641298796 0 0 cacheable
--  TTL            VCL 120 10 0 1641298797 cacheable
--  TTL            VCL 120 10 0 1641298797 uncacheable
--  BerespHeader   Vary: Accept-Encoding

The Cache-Control: private, no-cache, no-store response header is responsible for this response not being cacheable, which is in accordance with the the built-in VCL for this scenario.

Because of Surrogate-Control: no-store header

When the Surrogate-Control response header contains the no-store keyword, Varnish will not cache the response.

Here’s some VSL output for this scenario:

$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
	-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
	-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
**  << BeReq    >> 46
--  BereqURL       /
--  BerespHeader   Surrogate-Control: no-store
--  TTL            RFC 120 10 0 1641298986 1641298986 1641298985 0 0 cacheable
--  TTL            VCL 120 10 0 1641298986 cacheable
--  TTL            VCL 120 10 0 1641298986 uncacheable
--  BerespHeader   Vary: Accept-Encoding

Have a look at the built-in VCL for this scenario.

The built-in VCL describes how Varnish deals with cookies out-of-the-box.

But while the occurrence of a Cookie request header causes a cache bypass, the occurrence of a Set-Cookie header in the response causes a Hit-For-Miss object being created.

When setting a cookie, a state change is implied, which is not cacheable. The built-in VCL has measures in place to prevent responses with Set-Cookie headers from being cached.

Here’s some VSL output for this scenario:

$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
	-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
	-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
**  << BeReq    >> 49
--  BereqURL       /
--  BerespHeader   Set-Cookie: language=en; expires=Thu, 03-Feb-2022 12:31:26 GMT; Max-Age=2592000; path=/
--  TTL            RFC 120 10 0 1641299487 1641299487 1641299486 0 0 cacheable
--  TTL            VCL 120 10 0 1641299487 cacheable
--  TTL            VCL 120 10 0 1641299487 uncacheable
--  BerespHeader   Vary: Accept-Encoding

Because of wildcard variations

The Vary header can be used to issue cache variations based on certain request header values.

A Vary: Accept-Language header will make sure a variation of a cached object is created based on the value of the Accept-Language request header. This ensures that you can cache multilingual content without having to write extra VCL.

The challenge is to strike the right balance between having enough cache variations and maintaining an acceptable hit rate.

But if a response contains a Vary: * header, Varnish is supposed to vary on all request headers. This defies the purpose of caching and the built-in VCL considers this an uncacheable scenario.

Here’s some VSL output for this scenario:

$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
	-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
	-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
**  << BeReq    >> 32813
--  BereqURL       /
--  BerespHeader   Vary: *
--  TTL            RFC 120 10 0 1641299772 1641299772 1641299771 0 0 cacheable
--  TTL            VCL 120 10 0 1641299772 cacheable
--  TTL            VCL 120 10 0 1641299772 uncacheable
--  BerespHeader   Vary: *, Accept-Encoding

Debug logs

Throughout this tutorial we’ve used the Varnish Shared Memory Logs (VSL) quite extensively. However, from a troubleshooting and debugging point of view, we haven’t yet looked at the Debug VSL tag.

Varnish will occassionally add some debug information to the logs through the Debug tag. VMOD developers may also use this tag to log information that wouldn’t fit in any other tag.

One example is an HTTP/2 upgrade request on a Varnish system that doesn’t have HTTP/2 enabled. The protocol upgrade happens through the Upgrade: h2c request header.

When Varnish notices it cannot upgrade the connection to HTTP/2 because the -p feature=+http2 runtime parameter wasn’t passed, it continues to serve HTTP/1.1 content, but logs the upgrade attempt through a Debug VSL tag.

Here’s an example:

$ sudo varnishlog -g request -c -i "Req*" -i Debug
*   << Request  >> 54
-   Debug          "H2 upgrade attempt%00"
-   ReqStart       127.0.0.1 47000 a0
-   ReqMethod      GET
-   ReqURL         /
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: localhost
-   ReqHeader      User-Agent: curl/7.74.0
-   ReqHeader      Accept: */*
-   ReqHeader      Upgrade: h2c
-   ReqHeader      X-Forwarded-For: 127.0.0.1
-   ReqAcct        87 0 87 273 33 306

Timeouts

Varnish has some timeouts that can be configured, which can improve the overall experience.

Although you can focus on reducing the amount of time it takes to receive a request or the amount of time it takes to generate a response, sometimes it just makes sense to increase the timeout value.

Varnish allows you to configure both client-side timeouts as well as a collection of backend timeouts.

Client-side timeouts

The timeout_idle parameter is one of these. It’s a sort of keep-alive timeout that defines how long a connection remains open after a request. If no new pipelined request is received on the connection within five seconds, the connection is closed.

The idle_send_timeout parameter defines how long Varnish is willing to wait for the next bytes, after having already received data. This is a typical between-bytes timeout.

And then there’s also the send_timeout parameter, which acts as a last-byte timeout.

From a Denial of Service (DoS) perspective these settings can help you prevent slowloris attacks.

From a performance point of view, these settings can also be used to improve the end-user experience. If your Varnish servers sit behind a set of load balancers, it makes sense to increase timeout_idle because you know they are the only devices that are directly connecting to Varnish, and they are most probably going to reuse their connections with Varnish.

If you’re handling large volumes of data that are processed by potentially slow clients, you can also increase the send_timeout value.

Backend timeouts

For requests that cannot be served from the cache, a backend connection is made to the origin server, which acts as the source of truth.

If your backend is slow, or the connection is unreliable, backend connections might be left open for too long. It is also possible that the connection is closed while data is still being sent.

In order to strike the right balance, Varnish offers a set of backend timeouts:

  • connect_timeout: the amount of time Varnish is willing to wait for the backend to accept the connection
  • first_byte_timeout: the timeout for receiving the first byte from the origin
  • between_bytes_timeout: the amount of time we are willing to wait in between receiving bytes from the backend

Here’s a quick reminder on how to configure this in VCL:

vcl 4.1;

backend default {
    .host = "origin.example.com";
    .port = "80";
    .connect_timeout = "10s";
    .first_byte_timeout = "90s";
    .between_bytes_timeout = "5s";
}

These settings are of course also available as varnishd runtime parameters, but it is important to know that the values in VCL are on a per-backend basis, and take precedence over the runtime parameters.

These parameters can also be specified on a per-request basis, using bereq.connect_timeout or bereq.first_byte_timeout from VCL_backend_fetch in VCL.

The backend_idle_timeout parameter is not configurable in VCL, defaults to 60 seconds, and defines how long a backend connection can be idle before Varnish closes it.

Performance issues

As mentioned in the beginning of this tutorial: the performance of Varnish can depend on the available server resources and traffic patterns. The configuration of runtime parameters can also play a part in this.

It is important to point out that the default values of the various runtime parameters were carefully considered. But that doesn’t mean they are right for your environment.

Degraded performance in Varnish usually takes place when Varnish doesn’t have enough cache storage, when Varnish doesn’t have enough workspace memory or when Varnish doesn’t have enough threads.

Not enough caching storage

By default, Varnish allocates 256 MB of memory for caching objects. That’s not a lot, but it’s conservative and should work on systems that have less than 1GB of RAM.

If Varnish is the only service running on the server, the rule of thumb is to allocate 80% of the server’s available memory to Varnish. This percentage only accounts for the object storage. Memory that is used to process requests is not part of that calculation though.

Monitoring the object storage memory consumption in Varnish can be done through the varnishstat program that is part of your Varnish installation.

varnishstat displays continuously updated counters, which can be filtered by using the -f option.

Here’s a varnishstat example that only monitors Shared Memory Allocation counters:

sudo varnishstat -f "SMA.*.g_*" -1

Here’s the output:

SMA.s0.g_alloc                  40      .   Allocations outstanding
SMA.s0.g_bytes            41445760      .   Bytes outstanding
SMA.s0.g_space            63411840      .   Bytes available
SMA.Transient.g_alloc           23      .   Allocations outstanding
SMA.Transient.g_bytes         7176      .   Bytes outstanding
SMA.Transient.g_space            0      .   Bytes available

We can conclude that this server still has 63411840 bytes of storage available and 41445760 bytes of storage are in use.

The Transient storage engine is a secondary storage engine that is not limited in size. It is used for shortlived objects and for streaming uncacheable data.

If you’re running out of memory and Varnish needs to store new objects, it will evict the least recently used (LRU) objects to free up space.

The MAIN.n_lru_nuked counter in varnishstat will increase when this happens. This is a clear warning that you’re running out of memory. This is not disastrous because we’re dealing with a cache: the content can always be fetched from the origin server.

Increasing the cache storage

If it turns out that you haven’t allocated enough memory to Varnish, you can increase the cache storage by modifying the -s runtime parameter in your systemd unit file.

You can edit the runtime configuration through the following command:

sudo systemctl edit --full varnish

An editor will open and this is what you’ll see:

[Unit]
Description=Varnish Cache, a high-performance HTTP accelerator
After=network-online.target nss-lookup.target

[Service]
Type=forking
KillMode=process

# Maximum number of open files (for ulimit -n)
LimitNOFILE=131072

# Locked shared memory - should suffice to lock the shared memory log
# (varnishd -l argument)
# Default log size is 80MB vsl + 1M vsm + header -> 82MB
# unit is bytes
LimitMEMLOCK=85983232

# Enable this to avoid "fork failed" on reload.
TasksMax=infinity

# Maximum size of the corefile.
LimitCORE=infinity

ExecStart=/usr/sbin/varnishd \
	  -a :80 \
	  -a localhost:8443,PROXY \
	  -p feature=+http2 \
	  -f /etc/varnish/default.vcl \
	  -s malloc,256m
ExecReload=/usr/sbin/varnishreload

[Install]
WantedBy=multi-user.target

Please change -s malloc,256m with a value that suits your needs. Once you’re done, save the file and close the editor.

Please restart Varnish to enable these new settings by running the following command:

sudo systemctl restart varnish

Not enough workspace memory

Workspace memory is a Varnish-specific concept. This type of memory is used for request and response parsing, for VCL storage, and also for any VMOD requiring memory space to store data.

There are different kinds of workspaces, and each of them can be tuned. When a workspace overflow occurs, this means the transactions couldn’t allocate enough memory to perform their tasks.

  • The workspace_client parameter, with a default value of 64 KB, is used to limit memory allocation for HTTP request handling.
  • The workspace_backend parameter, which also has a default value of 64 KB, sets the amount of memory that can be used during backend processing.
  • The workspace_session parameter limits the size of workspace memory used to establish the TCP connection to Varnish. The default value is 0.5 KB.

When a task consumes more memory than allowed in one of the specific workspace contexts, the transaction is aborted, and an HTTP 503 response is returned. When a workspace_session overflow occurs, the connection will be closed.

It is always possible to increase the size of the various workspaces. Memory consumption depends on what happens in VCL, but also depends on the input Varnish receives from clients and backends.

A better solution is to optimize your VCL, or reduce the size and the amount of headers that are sent by the backend. But sometimes, you have no control over this, or no way to significantly reduce memory consumption. In that case, increasing workspace_client or workspace_backend is your best move.

Luckily there are ways to monitor workspace overflow. These workspaces have a varnishstat overflow counter:

  • MAIN.ws_client_overflow
  • MAIN.ws_backend_overflow
  • MAIN.ws_session_overflow

When these counters start increasing, don’t blindly increase the workspace size. Instead, have a look at your logs, see which transactions cause the overflow, and try to figure out if you can optimize that part of your VCL to avoid the overflows in the future.

As always, varnishstat and varnishlog will be the tools you need to figure out what is going on before deciding to increase the size of the workspaces.

Not enough threads

The threading settings are the best way to tune the concurrency of Varnish. The thread_pool_min and thread_pool_max runtime parameters control how many threads are in the thread pool.

By default there are two thread pools, which can be changed by modifying the thread_pools parameter. But benchmarks have shown no significant improvement by changing the value.

When varnishd starts, 200 threads are created in advance: 100 threads per thread pool. The thread_pool_min parameter can be used to tune this number.

Growing the thread pools

When the thread pools are running out of available threads, Varnish will grow the pools until thread_pool_max is reached. Growing the pools is also somewhat resource intensive.

By increasing thread_pool_min, it’s easier to cope with an onslaught of incoming traffic upon restart. This is common when Varnish sits behind a load balancer and is suddenly added to the rotation.

On the other hand, starting varnishd with too many threads will have an impact on the resource consumption of your server.

A good indicator is the MAIN.threads counter in varnishstat. It lets you know how many threads are currently active. You can correlate this number to your resource usage, and it helps you establish a baseline.

The thread_pool_min parameter should at least be the number of threads in use for an average traffic pattern.

When the MAIN.threads_limited counter increases, it means the thread pools ran out of available threads. If there is room to grow the pools, Varnish will add threads. If not, tasks will be queued until a thread is available.

The MAIN.threads_limited counter might increase early on when the MAIN.threads counter reaches the thread_pool_min threshold. As the pools are grown, it won’t occur that much. But when MAIN.threads reaches the thread_pool_max value, the change rate of the MAIN.threads_limited counter can indicate problematic behavior.

The thread queue doesn’t have an infinite size: each thread pool has a queue limit of 20 tasks. This is configurable via the thread_queue_limit parameter. When the queue is full, any new task or request will be dropped.

So when the MAIN.threads_limited counter increases, and the MAIN.sess_dropped or MAIN.req_dropped counters are increasing as well, know that the queue is full and sessions/streams are being dropped.

You can choose to increase the thread_queue_limit, which will allow more tasks to be queued. Unless resources are too tight, you really want to increase thread_pool_min because it will make your system more responsive.

Shrinking the thread pools

When a thread has been idle for 300 seconds, Varnish will clean it up. This is controlled by the thread_pool_timeout parameter. The MAIN.threads counter will reflect this.

This means Varnish will automatically shrink the thread pools based on demand, but with a delay. But if increased server load, caused by varnishd worker threads, is too much for your system, you should decrease thread_pool_max to an acceptable level.

If you believe Varnish needs to clean up idle threads quicker, you can reduce the thread_pool_timeout. But remember: destroying threads also consumes resources.

Tuning the worker stack size

Another factor that will impact server load is the worker stack size. This is stack space that is consumed by every worker thread. By limiting the stack size, we manage to reduce the memory footprint of Varnish on the system. The size is configurable via the thread_pool_stack parameter. The default stack size in Varnish is 48 KB. The default process stack size on Linux is typically multiple orders of magnitude larger than the stack sizes we use in Varnish.

Stack space is typically consumed by third-party libraries that are used by Varnish. libpcre, the library to run Perl Compatible Regular Expressions, can consume quite a bit of stack space. If you write very complicated regular expressions in VCL, you might even cause a stack overflow.

When a stack overflow happens, you should increase the value of thread_pool_stack. But this, in its turn, will have a direct impact on resource consumption because the worker stack size is per thread.

If you set your worker stack size to 100 KB and you have 5000 threads in two thread pools, this will consume almost 1 GB of memory. So be careful, and consider reducing thread_pool_max when this would be too taxing on your system.

HTTP limits

HTTP requests and responses are parsed by Varnish. As mentioned earlier, parsing them requires a bit of workspace memory.

Incoming requests and cached responses are parsed in the client context, and use client workspace memory. When a cache miss takes place, and the response needs to be parsed from the origin server, we operate in the backend context. This will consume backend workspace memory.

There are certain limits in place that prevent Varnish from having to waste too much memory on request and response parsing and to avoid DoS attacks.

Here’s an overview of runtime parameters that limit the length and size of requests and responses:

  • http_max_hdr: the maximum number of headers an HTTP request or response may contain. The default value is 64.
  • http_req_hdr_len: the maximum size of an individual request header. By default this is 8 KB.
  • http_req_size: the maximum total size of the HTTP request headers. This defaults to 32 KB.
  • http_resp_hdr_len: the maximum size of an individual response header. By default this is 8 KB.
  • http_resp_size: the maximum total size of the HTTP response headers. This defaults to 32 KB.

When requests or responses exceed these limits, the transaction will fail.

HTTP request limit examples

Here’s some example logging output when the http_max_hdr threshold is exceeded:

*   << Request  >> 5
-   Begin          req 4 rxreq
-   Timestamp      Start: 1611051232.286266 0.000000 0.000000
-   Timestamp      Req: 1611051232.286266 0.000000 0.000000
-   BogoHeader     Too many headers: foo:bar
-   HttpGarbage    "GET%00"
-   RespProtocol   HTTP/1.1
-   RespStatus     400
-   RespReason     Bad Request
-   ReqAcct        519 0 519 28 0 28
-   End

As you can see, an HTTP 400 status code is returned when this happens.

Here’s an example where an individual request header exceeds the http_req_hdr_len limit:

*   << Request  >> 98314
-   Begin          req 98313 rxreq
-   Timestamp      Start: 1611051653.320914 0.000000 0.000000
-   Timestamp      Req: 1611051653.320914 0.000000 0.000000
-   BogoHeader     Header too long: test:YnEJyVqxTMgn7aX
-   HttpGarbage    "HEAD%00"
-   RespProtocol   HTTP/1.1
-   RespStatus     400
-   RespReason     Bad Request
-   ReqAcct        10081 0 10081 28 0 28
-   End

When the total request size exceeds http_req_size, the following output can be found in your VSL:

*   << Session  >> 32793
-   Begin          sess 0 HTTP/1
-   SessOpen       172.21.0.1 60576 http 172.21.0.3 80 1611052643.429084 30
-   SessClose      RX_OVERFLOW 0.001
-   End

HTTP response limit examples

When the origin server returns too many headers and exceeds the http_max_hdr limit, this doesn’t result in an HTTP 400 status, but in an actual HTTP 503.

You might see the following output appear in your VSL:

-- BogoHeader     Too many headers:foo: bar
-- HttpGarbage    "HTTP/1.1%00"
-- BerespStatus   503
-- BerespReason   Service Unavailable
-- FetchError     http format error

And when this happens, the MAIN.losthdr counter in varnishstat will also increase.

When the http_resp_hdr_len limit is exceeded, you will see the following output end up in VSL:

--  BogoHeader     Header too long: Last-Modified: Tue,
--  HttpGarbage    "HTTP/1.1%00"
--  BerespStatus   503
--  BerespReason   Service Unavailable
--  FetchError     http format error

And finally, when the http_resp_size limit is exceeded, the following VSL line may serve as an indicator:

--  FetchError     overflow

Get help

Troubleshooting can be hard sometimes. Although this detailed troubleshooting tutorial can get you a long way, you might need some help figuring out what’s going on.

Luckily there are ways to get help: help from the community and help from Varnish Software’s support department.

Get help