ThingyMaJig

Thingy Ma Jig is the blog of Nicholas Thompson and contains any useful tips, sites and general blog-stuff which are considered interesting or handy!

Connect

LinkedIn GitHub

Topics

announcement 25 apache 3 Apple 1 bash 8 code 7 cool 30 Days Out 8 Dark Basic Pro 4 design 12 doctor who 1 Drupal 74 E4600 1 EOS 400D 3 firefox 2 Flickr 3 free 21 games 5 geek 38 git 2 GreaseMonkey 1 hardware 7 Homebrew 1 How to 37 humour 5 iphone 1 javascript 1 jquery 1 K800i 6 k850i 4 lighttpd 3 linux 33 mac 9 miscellaneous 4 mobile phone 9 music 4 mysql 8 n73 1 n95 1 New Relic 1 Ogre3D 1 OS X 2 performance 3 photos 10 programming 40 Quicksilver 1 review 19 security 3 SEO 6 software 12 svn 2 technology 4 tip 7 tips 10 tv 3 video 3 vim 7 webdev 2 websites 33 wii 1 windows 1 YADS 10

Bash script to find most common 404

Posted on 04 March 2010 in
programming linux How to geek Drupal apache

Following Kevin Hankens post on why you shouldn't ignore Drupal 404 errors, I decided to go through yesterdays error_log on our live apache server (the one which hosts www.pponline.co.uk, www.sportbusiness.com and www.mychild.co.uk alongside around 40 other dupral sites).

It turns out there were almost 5,000 404 (page not found) errors. How to find the most "popular" ones though? This called for a Bash script…

gawk '{ print $13 }' error_log.1  | grep ^/var | sort | uniq -c | sort -n

This uses gawk to parse yesterdays error log (hence the .1) and return column 13, assuming space is the default delimiter. Note: It turns out single quotes and double quotes mean different things to gawk!. Next I want to filter out lines beginning with "/var"; the gawk also returned values from memory & PHP errors. Next, sort them and do a unique liens count. Finally, sort this result with the most common entries at the end.

The result? SportBusiness REALLY needs a favicon in the default place - that along accounted for 20% of the 404's!

UPDATE

gunzip -c /var/log/httpd/error_log.3.gz | gawk '{ print $13 }' | grep ^/var | sort | uniq -c | sort -n

This is a slight alternative, if you use compressed log files is the following (it saves decompressing the file first).