Speeding up Varnish Cache

TLDR; Don’t send http requests though more tests than you need to, especially 100+ item regexp comparisons.

At work we run Varnish Cache in front of our websites. It caches webpages pages so that when people request the same page it gets served quickly out of cache rather than having to be generated by the application each time. Since pages only change every few minutes (at most) and things like pictures barely every change we can serve 98-99% of request out of cache and handle the website with a lot fewer servers.

Last week I noticed that one of our varnish servers (we have several) was using about 60% of CPU (60% on each core) to serve just 2600 hits/second. In the past we’ve seen the servers get a little overloaded at around 4-5000 hits/second while people on the varnish mailing list report getting over 10,000 hits/second easily. I decided to spend a few hours playing with our varnish config to see if I could speed things up.

I suspected the problem was with some regexps we had in vcl-recv which is run for every request received by the cache. But first I setup a test enviroment to help me trace the problem.

  1. Install Varnish on a test VM ( I’ll call it “server1″ ) with our production config
  2. Run varnishncsa on a production box for a while and copy over the logs  ( I copied 1.6 million lines ) to another VM “client1″
  3. Install some http benchmarking software on client1
  4. Both server1 and client1 were since CPU ( single core ) with 1GB of RAM on server1 and 750MB on client1.
I actually found the benchmarking software to be a pain. I tried httperf, ab, and siege and found they all had their limitations. The hardest bit was we run multiple domains and it was hard to tell a program to “run their this list of URLs and send them all to this IP” so I ended up just creating about 30 host file entries.

After a bit of playing around I ended up using siege for the testing and using the command line “siege -c 500 -d 1” which generated 1000 requests/second and the options “internet = true” (to pick random urls from the file). I used a list of 100,000 urls from production. I found that sending this many requests used about 68% of the CPU on my server1 while sending more requests or using a larger list of urls tended to overload client1. For the back-end I just used our production servers.

To test I ran siege for at least minute ( to get the cache full ) and then ran vmstat and varnishstat to get the CPU and hitrate once this settled down. The CPU usage jumps around by a couple of percent but the trends are usually obvious.

Now for some actual testing. First I tested our existing config:

Original Config                              CPU: 70%    Hit rate: 98%
Now I switched back to the default varnish config. The hitrate drops a huge amount since our config doesn’t things like remove “?=1354844750363″ at the end of URLs
Default Config                               CPU: 41%    Hit rate: 86%
Now I started adding back bits from our config to see how load increased.
Default + vcl-fetch + vcl-miss + vcl-error   CPU: 39%    Hit rate: 83%
Above   + vcl-deliver                        CPU: 44%    Hit rate: 82%
Above   + production backend configuration   CPU: 41%    Hit rate: 82%
Above   + vcl-hit and expire config          CPU: 43%    Hit rate: 82%
Above   + vcl-recv                           CPU: 68%    Hit rate: 98%
So it appears that the only bit of the config that makes a serious difference is the vcl
recv.

Our vcl-recv is 500 lines long and looks a bit like VCLExampleAlex from the Varnish website. It included:

  • 6 separate blocks of “mobile redirection” code. 2 of these or’d about 120 different brands in User-Agent header. 3 of these applied to production and there were staging copies of each of them too.
  • About 20 groups of URL cleanup routines, many with ” if ( req.http.host == ” so they applied to only one domain.
  • Several other per-domain routines ( which did things like set all request for some domain to “pass” though the cache )
  • Most of the “if” statements were fuzzy regexp matches like ” if ( req.url ~ “utm_” )
Overall there were 32 “if” statements that every request went though.

I decided to try and reduce the number of lookups the average request would have to go though. I did this by rearranging the config so that the tests that applied to all requests were first and then I split the rest of the config by domain:

if ( req.url ~ "utm" ) {
 set req.url = regsub(req.url, "\&utm[^&]+","");
 set req.url = regsub(req.url, "\&utm_[^&]+","");
}
if ( req.http.host == "www.example.com" ) {
 set req.url = regsub(req.url, "&ref=[^\&]+","");
} else if ( req.http.host == "media.example.com" ) {
  # Nothing do do for media domain
} else if ( req.http.host == "www.example.net" ) {
 return (pass);
} else {
 # Obscure domains
 if  ( req.http.host == "staging.example.com" {
 return (pass);
 }
}
Specific bits I did included:
  • Make the most popular domain that the top of the config so they would be matched first
  • Put domains that got very few his into the default “else” rather than wasting them on their own “else if”
  • The media domain got the 2nd highest number of hits but had no special configs so I gave it it’s own “empty” routine rather than letting it fall though to the default “else”
So recalling what I previously had here is the improvement:
Original Config                              CPU: 70%    Hit rate: 98%
Split vcl-recv by domain                     CPU: 44%    Hit rate: 98%
I then removed the per-rule domain tests since the rules were now within a single test for that domain and got:
Don't check domain in each rule              CPU: 42%    Hit rate: 98%
After some more testing I deployed in production

The next step I did was update the mobile redirect rules so that instaed of them going:

if  ( req.url ~ "/news.cfm" || req.url ~ "/article.cfm" || req.url == "/" )
  && req.http.user-agent ~ ".*(Sagem|SAGEM|Sendo|SonyEricsson|plus another 100 terms
for each request to the domain instead wrap the following around them
if ( req.url == "/" || req.url ~ ".cfm" ) {
}
so that only a small percentage of requests would need to be processed by the ” Giant regexp of Doom™ “

I tested this and got:

Wrapper around mobile redirects             CPU: 36%    Hit rate: 98%
On an actual production server I got the following with around 600 hits/second
Original Config                      CPU: 25%
Split by Domains                     CPU: 18%
Split by domains + mobile wrapper    CPU: 11%
So overall a better than 50% reduction in CPU usage.