Linux.conf.au 2013 – Day 0 Sunday

So I am off to my 10th Linux.conf.au ( every year from 2004 ) in Canberra Australia.

To get there I flew over to Sydney ( leave 7am Sunday, arrive 8am ) and then took the bus down to Canberra (leave 10:15 , arrive 13:30 ) and then the LCA people organised a bus for us from the bus station to the halls.

I’m staying at John XXIII Hall on campus. As you may guess from the name it is a Catholic College which means there are photos of the Pope on the condom machines in the toilets. No aircon in the rooms but that is pretty common, wired Internet was working though.

Signup was pretty effecient this time around (apart from a big queue at start ), we just gave over a piece of paper with our name on it, they typed our name in the the software and printed out our badge.

This year the bags were pre-placed in our room with our stuff in it, which is a good idea since it speeds up registration. Stuff in my bag was a t-shirt and a hygiene pack ( with shampoo, conditioner, soap and sunblock ).

I went for a walk with Devdas Bhaget to look for some lunch, it was pretty hot so hard going. We got a little lost too and were unable to find the pub so just ended up getting some snacks for lunch and heading back. Later I went with a group of people out to the Pub for dinner

Share

Links: Parody trailers, Obama fundraising, Japan!, American academic jobs

Share

Links: Exercise, Radio NZ Hosting, Scale in the Cloud, Philip Roth

Share

Speeding up Varnish Cache

TLDR; Don’t send http requests though more tests than you need to, especially 100+ item regexp comparisons.

At work we run Varnish Cache in front of our websites. It caches webpages pages so that when people request the same page it gets served quickly out of cache rather than having to be generated by the application each time. Since pages only change every few minutes (at most) and things like pictures barely every change we can serve 98-99% of request out of cache and handle the website with a lot fewer servers.

Last week I noticed that one of our varnish servers (we have several) was using about 60% of CPU (60% on each core) to serve just 2600 hits/second. In the past we’ve seen the servers get a little overloaded at around 4-5000 hits/second while people on the varnish mailing list report getting over 10,000 hits/second easily. I decided to spend a few hours playing with our varnish config to see if I could speed things up.

I suspected the problem was with some regexps we had in vcl-recv which is run for every request received by the cache. But first I setup a test enviroment to help me trace the problem.

  1. Install Varnish on a test VM ( I’ll call it “server1” ) with our production config
  2. Run varnishncsa on a production box for a while and copy over the logs  ( I copied 1.6 million lines ) to another VM “client1”
  3. Install some http benchmarking software on client1
  4. Both server1 and client1 were since CPU ( single core ) with 1GB of RAM on server1 and 750MB on client1.

I actually found the benchmarking software to be a pain. I tried httperf, ab, and siege and found they all had their limitations. The hardest bit was we run multiple domains and it was hard to tell a program to “run their this list of URLs and send them all to this IP” so I ended up just creating about 30 host file entries.

After a bit of playing around I ended up using siege for the testing and using the command line “siege -c 500 -d 1” which generated 1000 requests/second and the options “internet = true” (to pick random urls from the file). I used a list of 100,000 urls from production. I found that sending this many requests used about 68% of the CPU on my server1 while sending more requests or using a larger list of urls tended to overload client1. For the back-end I just used our production servers.

To test I ran siege for at least minute ( to get the cache full ) and then ran vmstat and varnishstat to get the CPU and hitrate once this settled down. The CPU usage jumps around by a couple of percent but the trends are usually obvious.

Now for some actual testing. First I tested our existing config:

Original Config                              CPU: 70%    Hit rate: 98%

Now I switched back to the default varnish config. The hitrate drops a huge amount since our config doesn’t things like remove “?_=1354844750363” at the end of URLs

Default Config                               CPU: 41%    Hit rate: 86%

Now I started adding back bits from our config to see how load increased.

Default + vcl-fetch + vcl-miss + vcl-error   CPU: 39%    Hit rate: 83%
Above   + vcl-deliver                        CPU: 44%    Hit rate: 82%
Above   + production backend configuration   CPU: 41%    Hit rate: 82%
Above   + vcl-hit and expire config          CPU: 43%    Hit rate: 82%
Above   + vcl-recv                           CPU: 68%    Hit rate: 98%

So it appears that the only bit of the config that makes a serious difference is the vcl_recv.

Our vcl-recv is 500 lines long and looks a bit like VCLExampleAlex from the Varnish website. It included:

  • 6 separate blocks of “mobile redirection” code. 2 of these or’d about 120 different brands in User-Agent header. 3 of these applied to production and there were staging copies of each of them too.
  • About 20 groups of URL cleanup routines, many with ” if ( req.http.host == ” so they applied to only one domain.
  • Several other per-domain routines ( which did things like set all request for some domain to “pass” though the cache )
  • Most of the “if” statements were fuzzy regexp matches like ” if ( req.url ~ “utm_” )

Overall there were 32 “if” statements that every request went though.

I decided to try and reduce the number of lookups the average request would have to go though. I did this by rearranging the config so that the tests that applied to all requests were first and then I split the rest of the config by domain:

if ( req.url ~ "utm_" ) {
 set req.url = regsub(req.url, "\&utm_[^&]+","");
 set req.url = regsub(req.url, "\&utm_[^&]+","");
}
if ( req.http.host == "www.example.com" ) {
 set req.url = regsub(req.url, "&ref=[^\&]+","");
} else if ( req.http.host == "media.example.com" ) {
  # Nothing do do for media domain
} else if ( req.http.host == "www.example.net" ) {
 return (pass);
} else {
 # Obscure domains
 if  ( req.http.host == "staging.example.com" {
 return (pass);
 }
}

Specific bits I did included:

  • Make the most popular domain that the top of the config so they would be matched first
  • Put domains that got very few his into the default “else” rather than wasting them on their own “else if”
  • The media domain got the 2nd highest number of hits but had no special configs so I gave it it’s own “empty” routine rather than letting it fall though to the default “else”

So recalling what I previously had here is the improvement:

Original Config                              CPU: 70%    Hit rate: 98%
Split vcl-recv by domain                     CPU: 44%    Hit rate: 98%

I then removed the per-rule domain tests since the rules were now within a single test for that domain and got:

Don't check domain in each rule              CPU: 42%    Hit rate: 98%

After some more testing I deployed in production

The next step I did was update the mobile redirect rules so that instaed of them going:

if  ( req.url ~ "/news.cfm" || req.url ~ "/article.cfm" || req.url == "/" )
  && req.http.user-agent ~ ".*(Sagem|SAGEM|Sendo|SonyEricsson|plus another 100 terms

for each request to the domain instead wrap the following around them

if ( req.url == "/" || req.url ~ ".cfm" ) {
}

so that only a small percentage of requests would need to be processed by the ” Giant regexp of Doom™ ”

I tested this and got:

Wrapper around mobile redirects             CPU: 36%    Hit rate: 98%

On an actual production server I got the following with around 600 hits/second

Original Config                      CPU: 25%
Split by Domains                     CPU: 18%
Split by domains + mobile wrapper    CPU: 11%

So overall a better than 50% reduction in CPU usage.

Share

Links: A/B testing, Road safety, Where IT goes to die, The Cheapest Generation

  • 23 Tips on How to A/B Test Like a Badass – Really great article, mostly applicable to ecommerce sites but plenty of general ideas. I’d like to say I do this at work but unfortunately the culture isn’t there.
  • What an RAF pilot can teach us about being safe on the road – Interesting view on how easy it is for motorists just “not to notice” cyclists. I commonly see cyclists these days with flashing lights (front and back) turned on during the day to help improve their visibility.
  • The Cheapest Generation – How the consumer wants of young adults differe from those of the previous generation(s). Phones & Walkable neighbourhoods are in, Cars and big houses in the suburbs are no longer as important.
  •  Where IT goes to die – How large company/Enterprise IT works from an author with experience at smaller companies. He is accerate and it is not very pretty.
Share

Links: PIN numbers, Military leadership, Eggs, The Web stack

  • Analysis of PIN Numbers – What are the most and least common 4-digit PIN numbers?
  • General Failure – Does the US military no longer punish or demote bad generals? “A culture of mediocrity has taken hold within the Army’s leadership rank—if it is not uprooted, the country’s next war is unlikely to unfold any better than the last two.”
  • Why American Eggs Would Be Illegal In A British Supermarket, And Vice Versa – Different approaches to food safety in the US and Britain.
  • An Overview of the Web – What happens when your browser requests a web pagewebpage. A step by step though the various layers or computers, protocols and programs. Understandable buy someone a little technical.
Share

Sysadmin Miniconf proposals close in 2 days for linux.conf.au 2013

Once again I’m helping to organise the Sysadmin Miniconf at Linux.conf.au . This time we’ll be in Canberra in the last week of January  2013.

This is a big reminder that proposals for presentations at the Miniconf close at the End of October. If you have a proposal you need to submit it now.

Even if you’ve not 100% finalised your idea let us know now and we can work with you. If we don’t know about it then it is very hard for us to accept it.

We have several proposals that have already been accepted but are very keen to get more.

 

Share

Links: Mars!, The cdbaby model, Robots, the rest of the world on the web

Share

Links: Weather Prediction, Advertising, 3rd world mobile web and Database failover

 

 

Share

Squatters hit .kiwi.nz

A couple of days ago the .kiwi.nz second level domain was opened up. Within a day over 1000 domains were registered.

But I was wondering who is registering the domains, I though I’d have a quick look though some top brands and domains:

  • telecom.kiwi.nz – Squatter
  • vodafone.kiwi.nz – Squatter
  • 2degrees.kiwi.nz – Squatter
  • google.kiwi.nz – Squatter
  • yahoo.kiwi.nz – Squatter
  • bing.kiwi.nz – Squatter
  • youtube.kiwi.nz – Squatter
  • facebook.kiwi.nz – Squatter
  • trademe.kiwi.nz – Squatter
  • stuff.kiwi.nz – Squatter
  • nzherald.kiwi.nz – Squatter
  • msn.kiwi.nz – Available
  • wikipedia.kiwi.nz – Available
  • asb.kwi.nz – Squatter
  • bnz.kiwi.nz – Squatter
  • westpac.kiwi.nz – Squatter
  • kiwibank.kiwi.nz – Available
  • nationalbank.kiwi.nz – Squatter
  • tv3.kiwi.nz – Squatter
  • tvnz.kiwi.nz – Squatter
  • sky.kiwi.nz – Legit Owner
  • airnz.kiwi.nz – Legit Owner
  • skykiwi.kiwi.nz – Squatter
  • coke.kiwi.nz – Squatter
  • pepsi.kiwi.nz – Squatter

Several in the list above (and I assume other domains) have been registered by the same few people. Overall not a good look but I assume things will calm down after a few lawyers letters and dollars are exchanged.

Share