Speeding up Varnish Cache

TLDR; Don’t send http requests though more tests than you need to, especially 100+ item regexp comparisons.

At work we run Varnish Cache in front of our websites. It caches webpages pages so that when people request the same page it gets served quickly out of cache rather than having to be generated by the application each time. Since pages only change every few minutes (at most) and things like pictures barely every change we can serve 98-99% of request out of cache and handle the website with a lot fewer servers.

Last week I noticed that one of our varnish servers (we have several) was using about 60% of CPU (60% on each core) to serve just 2600 hits/second. In the past we’ve seen the servers get a little overloaded at around 4-5000 hits/second while people on the varnish mailing list report getting over 10,000 hits/second easily. I decided to spend a few hours playing with our varnish config to see if I could speed things up.

I suspected the problem was with some regexps we had in vcl-recv which is run for every request received by the cache. But first I setup a test enviroment to help me trace the problem.

  1. Install Varnish on a test VM ( I’ll call it “server1” ) with our production config
  2. Run varnishncsa on a production box for a while and copy over the logs  ( I copied 1.6 million lines ) to another VM “client1”
  3. Install some http benchmarking software on client1
  4. Both server1 and client1 were since CPU ( single core ) with 1GB of RAM on server1 and 750MB on client1.

I actually found the benchmarking software to be a pain. I tried httperf, ab, and siege and found they all had their limitations. The hardest bit was we run multiple domains and it was hard to tell a program to “run their this list of URLs and send them all to this IP” so I ended up just creating about 30 host file entries.

After a bit of playing around I ended up using siege for the testing and using the command line “siege -c 500 -d 1” which generated 1000 requests/second and the options “internet = true” (to pick random urls from the file). I used a list of 100,000 urls from production. I found that sending this many requests used about 68% of the CPU on my server1 while sending more requests or using a larger list of urls tended to overload client1. For the back-end I just used our production servers.

To test I ran siege for at least minute ( to get the cache full ) and then ran vmstat and varnishstat to get the CPU and hitrate once this settled down. The CPU usage jumps around by a couple of percent but the trends are usually obvious.

Now for some actual testing. First I tested our existing config:

Original Config                              CPU: 70%    Hit rate: 98%

Now I switched back to the default varnish config. The hitrate drops a huge amount since our config doesn’t things like remove “?_=1354844750363” at the end of URLs

Default Config                               CPU: 41%    Hit rate: 86%

Now I started adding back bits from our config to see how load increased.

Default + vcl-fetch + vcl-miss + vcl-error   CPU: 39%    Hit rate: 83%
Above   + vcl-deliver                        CPU: 44%    Hit rate: 82%
Above   + production backend configuration   CPU: 41%    Hit rate: 82%
Above   + vcl-hit and expire config          CPU: 43%    Hit rate: 82%
Above   + vcl-recv                           CPU: 68%    Hit rate: 98%

So it appears that the only bit of the config that makes a serious difference is the vcl_recv.

Our vcl-recv is 500 lines long and looks a bit like VCLExampleAlex from the Varnish website. It included:

  • 6 separate blocks of “mobile redirection” code. 2 of these or’d about 120 different brands in User-Agent header. 3 of these applied to production and there were staging copies of each of them too.
  • About 20 groups of URL cleanup routines, many with ” if ( req.http.host == ” so they applied to only one domain.
  • Several other per-domain routines ( which did things like set all request for some domain to “pass” though the cache )
  • Most of the “if” statements were fuzzy regexp matches like ” if ( req.url ~ “utm_” )

Overall there were 32 “if” statements that every request went though.

I decided to try and reduce the number of lookups the average request would have to go though. I did this by rearranging the config so that the tests that applied to all requests were first and then I split the rest of the config by domain:

if ( req.url ~ "utm_" ) {
 set req.url = regsub(req.url, "\&utm_[^&]+","");
 set req.url = regsub(req.url, "\&utm_[^&]+","");
}
if ( req.http.host == "www.example.com" ) {
 set req.url = regsub(req.url, "&ref=[^\&]+","");
} else if ( req.http.host == "media.example.com" ) {
  # Nothing do do for media domain
} else if ( req.http.host == "www.example.net" ) {
 return (pass);
} else {
 # Obscure domains
 if  ( req.http.host == "staging.example.com" {
 return (pass);
 }
}

Specific bits I did included:

  • Make the most popular domain that the top of the config so they would be matched first
  • Put domains that got very few his into the default “else” rather than wasting them on their own “else if”
  • The media domain got the 2nd highest number of hits but had no special configs so I gave it it’s own “empty” routine rather than letting it fall though to the default “else”

So recalling what I previously had here is the improvement:

Original Config                              CPU: 70%    Hit rate: 98%
Split vcl-recv by domain                     CPU: 44%    Hit rate: 98%

I then removed the per-rule domain tests since the rules were now within a single test for that domain and got:

Don't check domain in each rule              CPU: 42%    Hit rate: 98%

After some more testing I deployed in production

The next step I did was update the mobile redirect rules so that instaed of them going:

if  ( req.url ~ "/news.cfm" || req.url ~ "/article.cfm" || req.url == "/" )
  && req.http.user-agent ~ ".*(Sagem|SAGEM|Sendo|SonyEricsson|plus another 100 terms

for each request to the domain instead wrap the following around them

if ( req.url == "/" || req.url ~ ".cfm" ) {
}

so that only a small percentage of requests would need to be processed by the ” Giant regexp of Doom™ ”

I tested this and got:

Wrapper around mobile redirects             CPU: 36%    Hit rate: 98%

On an actual production server I got the following with around 600 hits/second

Original Config                      CPU: 25%
Split by Domains                     CPU: 18%
Split by domains + mobile wrapper    CPU: 11%

So overall a better than 50% reduction in CPU usage.

Share

Links: A/B testing, Road safety, Where IT goes to die, The Cheapest Generation

  • 23 Tips on How to A/B Test Like a Badass – Really great article, mostly applicable to ecommerce sites but plenty of general ideas. I’d like to say I do this at work but unfortunately the culture isn’t there.
  • What an RAF pilot can teach us about being safe on the road – Interesting view on how easy it is for motorists just “not to notice” cyclists. I commonly see cyclists these days with flashing lights (front and back) turned on during the day to help improve their visibility.
  • The Cheapest Generation – How the consumer wants of young adults differe from those of the previous generation(s). Phones & Walkable neighbourhoods are in, Cars and big houses in the suburbs are no longer as important.
  •  Where IT goes to die – How large company/Enterprise IT works from an author with experience at smaller companies. He is accerate and it is not very pretty.
Share

Links: PIN numbers, Military leadership, Eggs, The Web stack

  • Analysis of PIN Numbers – What are the most and least common 4-digit PIN numbers?
  • General Failure – Does the US military no longer punish or demote bad generals? “A culture of mediocrity has taken hold within the Army’s leadership rank—if it is not uprooted, the country’s next war is unlikely to unfold any better than the last two.”
  • Why American Eggs Would Be Illegal In A British Supermarket, And Vice Versa – Different approaches to food safety in the US and Britain.
  • An Overview of the Web – What happens when your browser requests a web pagewebpage. A step by step though the various layers or computers, protocols and programs. Understandable buy someone a little technical.
Share

Sysadmin Miniconf proposals close in 2 days for linux.conf.au 2013

Once again I’m helping to organise the Sysadmin Miniconf at Linux.conf.au . This time we’ll be in Canberra in the last week of January  2013.

This is a big reminder that proposals for presentations at the Miniconf close at the End of October. If you have a proposal you need to submit it now.

Even if you’ve not 100% finalised your idea let us know now and we can work with you. If we don’t know about it then it is very hard for us to accept it.

We have several proposals that have already been accepted but are very keen to get more.

 

Share

Links: Mars!, The cdbaby model, Robots, the rest of the world on the web

Share

Links: Weather Prediction, Advertising, 3rd world mobile web and Database failover

 

 

Share

Squatters hit .kiwi.nz

A couple of days ago the .kiwi.nz second level domain was opened up. Within a day over 1000 domains were registered.

But I was wondering who is registering the domains, I though I’d have a quick look though some top brands and domains:

  • telecom.kiwi.nz – Squatter
  • vodafone.kiwi.nz – Squatter
  • 2degrees.kiwi.nz – Squatter
  • google.kiwi.nz – Squatter
  • yahoo.kiwi.nz – Squatter
  • bing.kiwi.nz – Squatter
  • youtube.kiwi.nz – Squatter
  • facebook.kiwi.nz – Squatter
  • trademe.kiwi.nz – Squatter
  • stuff.kiwi.nz – Squatter
  • nzherald.kiwi.nz – Squatter
  • msn.kiwi.nz – Available
  • wikipedia.kiwi.nz – Available
  • asb.kwi.nz – Squatter
  • bnz.kiwi.nz – Squatter
  • westpac.kiwi.nz – Squatter
  • kiwibank.kiwi.nz – Available
  • nationalbank.kiwi.nz – Squatter
  • tv3.kiwi.nz – Squatter
  • tvnz.kiwi.nz – Squatter
  • sky.kiwi.nz – Legit Owner
  • airnz.kiwi.nz – Legit Owner
  • skykiwi.kiwi.nz – Squatter
  • coke.kiwi.nz – Squatter
  • pepsi.kiwi.nz – Squatter

Several in the list above (and I assume other domains) have been registered by the same few people. Overall not a good look but I assume things will calm down after a few lawyers letters and dollars are exchanged.

Share

Where the NZ eyeballs are

So I was wondering what is the market share of New Zealand ISPs these days, do Telecom and TelstraClear still completely dominate the market? or have the smaller ISPs caught up?

Earlier this week I grabbed a sample of the weblogs from a large New Zealand website and checked to see which networks the readers came from.

Using the tools at Team Cymru I checked looked up the origin ASN ( roughly ISP ) for that network, this enabled me to work out which percentage of the traffic came from which ISPs.

  • I’ve included the data from 2 times below, a daytime one for the “business traffic” and an evening sample for “home traffic”
  • Data during both periods is over 20Mb/s and from a general interest New Zealand website (anonymous as condition of releasing the data)
  • Only requests that originate from New Zealand are included.
  • There may be a bias of traffic towards Auckland
  • Note that only the largest sites will own their own networks, AS and run BGP . Thus sometimes very large companies and other organisations will be included as part of their ISP’s total.
  • Stats below are sorted by bandwidth used

I was interested to see how much Telecom (including the Netgate brand which is probably Chorus) still dominates. I was actually surprised to see that Telecom dominates the home market much more than the business market whereas I would have expected the other way around.

Between Noon and 1pm on Thursday the 6th of Sept.

Percent  ASN     ASN Description
36.82    4771    Telecom New Zealand Ltd.
15.10    4768    TelstraClear Ltd
7.41     4648    Netgate
5.60     17746   Orcon Internet
3.03     7657    Vodafone NZ Ltd.
2.69     9889    Maxnet / Vocus
2.33     9503    FX Networks Limited
1.91     38793   NZCOMMS. Mobile phone Company. New Zealand
1.90     9790    CallPlus Services Limited
1.73     10022   Internet access for Datacom Systems Auckland
1.41     23655   Snap Internet Limited
1.36     9431    The University of Auckland
1.28     9325    Telecom XTRA, Auckland
1.05     4770    ICONZ Ltd
0.92     17492   Vector Communications LTD.,
0.88     24183   DTS LTD
0.79     17435   WorldxChange Communications LTD
0.73     18353   Revera NZ Limited
0.71     9245    COMPASS NZ
0.66     18021   Unisys NZ, IT Outsourcer,
0.56     55454   Orcon Internet Ltd
0.48     9872    ITNet Ltd
0.41     17412   Woosh Wireless
0.39     45946   Air New Zealand Limited
0.36     9432    University of Canterbury
0.36     38305   The University of Otago
0.34     2570    Telecom New Zealand Ltd
0.32     9303    KC Computer Service Ltd.,
0.30     24324   Kordia Limited
0.29     17649   DMZGlobal Ltd
0.28     55872   BayCity Communications Limited
0.26     9345    Paradise Net
0.25     23838   Solarix Networks Limited
0.25     17705   InSPire Net Ltd
0.25     10200   Web hosting provider and ISP connectivity.
0.24     9876    Airnet
0.23     24318   Ministry of Education
0.23     17663   Housing New Zealand Corporation Internet
0.21     2687    AT&T Global Network Services - AP

Between 8pm and 10pm on Wednesday the 5th of Sept.

Percent  ASN     ASN Description
62.71    4771    Telecom New Zealand Ltd.
12.97    4768    TelstraClear Ltd
6.34     17746   Orcon Internet
2.89     7657    Vodafone NZ Ltd.
2.74     9790    CallPlus Services Limited
1.47     17435   WorldxChange Communications LTD
1.26     17412   Woosh Wireless
0.96     38793   NZCOMMS. Mobile phone Company. New Zealand
0.92     23655   Snap Internet Limited
0.81     4648    Netgate
0.63     9889    Maxnet / Vocus
0.49     55872   BayCity Communications Limited
0.46     4770    ICONZ Ltd
0.42     17705   InSPire Net Ltd
0.38     17994   Appserv Limited
0.37     9872    ITNet Ltd
0.35     45267   Lightwire LTD
0.32     9325    Telecom XTRA, Auckland
0.27     9245    COMPASS NZ
0.25     45946   Air New Zealand Limited
0.20     10200   Web hosting provider and ISP connectivity.
0.17     9303    KC Computer Service Ltd.,
0.16     9345    Paradise Net
0.16     45177   Layer2.co.nz
0.16     38305   The University of Otago
0.15     23735   Enternet Online Ltd
0.13     9431    The University of Auckland
0.13     24324   Kordia Limited
0.11     17492   Vector Communications LTD.,
0.10     9876    Airnet
0.10     55853   Megatel
Share

Links: Density done well, schizophrenia, Olympics and Socialist networks

Share

Links: 8bit T2, IQ, Baby names, Low Power computers

Share