Apache Log Regex: a lightweight Ruby Apache log parser

February 14th, 2009 at 9:03 am • permalink13 comments

This is going to be a really fruitful month for me. I completed a couple of long-time standing activities and I finally had some time to go back working on my Ruby GEMS. After the third version of my Ruby client for delicious API, this is the turn of Apache Log Regex.

ApacheLogRegex is designed to be a simple Ruby class to parse Apache log files. It takes an Apache logging format and generates a regular expression which is used to parse a line from a log file and returns a Hash with keys corresponding to the fields defined in the log format.

Take for example the following Apache log entry.

87.18.183.252 - - [13/Aug/2008:00:50:49 -0700] "GET /blog/index.xml HTTP/1.1" 302 527 "-" "Feedreader 3.13 (Powered by Newsbrain)"

You can easily parse it with Apache Log Regex and extract only the information you need.

# This is the log line you want to parse
line = '87.18.183.252 - - [13/Aug/2008:00:50:49 -0700] "GET /blog/index.xml HTTP/1.1" 302 527 "-" "Feedreader 3.13 (Powered by Newsbrain)"'

# Define the log file format.
# This information is defined in you Apache log file
# with the LogFormat directive
format = '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'

# Initialize the parser
parser = ApacheLogRegex.new(format)

# Get the log line as a Hash
parser.parse(line)
# => {"%r"=>"GET /blog/index.xml HTTP/1.1", "%h"=>"87.18.183.252", "%>s"=>"302", "%t"=>"[13/Aug/2008:00:50:49 -0700]", "%{User-Agent}i"=>"Feedreader 3.13 (Powered by Newsbrain)", "%u"=>"-", "%{Referer}i"=>"-", "%b"=>"527", "%l"=>"-"}

If you want more control over the parser you can use the parse! method. It raises a ParseError exception if given line doesn’t match the log format.

common_log_format = '%h %l %u %t "%r" %>s %b'
parser = ApacheLogRegex.new(common_log_format)

# No exception
parser.parse(line) # => nil

# Raises an exception
parser.parse!(line) # => ParseError

Instead of spending time parsing one line at once you can read entire log files and feed the parser collecting the final result.

result = File.readlines('/var/apache/access.log').collect do |line|
  parser.parse(line)
end

Apache Log Regex is a Ruby port of Peter Hickman’s Apache::LogRegex 1.4 Perl module, available at http://cpan.uwinnipeg.ca/~peterhi/Apache-LogRegex.

You can install the library via RubyGems.

$ gem install apachelogregex

Feel free to email me with any questions or feedback. For the documentation and more details you can visit the ApacheLogRegex project page.

  1. Ruby Whois preview: WHOIS answer and parser
  2. Logging external referers with Apache
  3. Apache .htaccess query string redirects
  4. Introducing the Public Suffix List library for Ruby
  5. Ruby Whois 0.8.0

Filed in Programming • Tags: , , , , , ,

Comments

Abhijat says:

Nice post.
I was looking for something exactly like this
Thanks for the head start.

Ralf Ebert says:

Very nice gem!

Is there a way to extract the referer from a combined_log string?
combined_log_format = ‘%h %l %u %t “%r” %>s %b “%{Referer}i” “%{User-agent}i”‘
(defined in http://httpd.apache.org/docs/2.0/mod/mod_log_config.html#customlog)

For the moment I’m using
line =~ /(.*) “(.*?)” “(.*?)”$/

Also the 02/12/2009 “Added apachelogregex file to simplify GEM usage” change is not in the standard gem you get when calling “gem install apachelogregex” (version 0.1.0)

Simone says:

Is there a way to extract the referer from a combined_log string?
combined_log_format = ‘%h %l %u %t “%r” %>s %b “%{Referer}i” “%{User-agent}i”‘

Sure!

require "rubygems"
require "apachelogregex"

format = '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'
parser = ApacheLogRegex.new(format)

referers = File.readlines('access.log').collect do |line|
  values = parser.parse!(line)
  values["%{Referer}i"]
end

Also the 02/12/2009 “Added apachelogregex file to simplify GEM usage” change is not in the standard gem you get when calling “gem install apachelogregex” (version 0.1.0)

I’m not sure what do you mean. I’ve just run

$ sudo gem install apachelogregex

and I can see the file apachelogregex.rb in the package.
Also, the following code works as expected

require "rubygems"
require "apachelogregex"
Ralf Ebert says:

Thanks! Sorry, I was wrong about the gem, works as expected.

btw, this ruby script extracts the google query strings from a combined apache log file:

require "rubygems"
require "apachelogregex"
require "uri"
require 'cgi'

format = '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'
parser = ApacheLogRegex.new(format)

result = STDIN.readlines.collect do |line|
  values = parser.parse!(line)
  referer = values["%{Referer}i"]
  if referer and referer.include?("google")
    query = URI.parse(referer).query
    puts CGI.parse(query)['q'] if query
  end
end
Ralf Ebert says:

It might be nice to add constants for the commonly used log format strings… Not sure what I did wrong when I tried to use %{Referer}i, probably some quick oversight.

Simone says:

That’s a nice idea.
http://code.simonecarletti.com/issues/show/184

Feel free to comment the ticket with more details.

Bruno says:

Can I use it to parse Icecast logs?. They are pretty the same as apache I guess.
It gives me a (ApacheLogRegex::ParseError) …

Any ideas?

Bruno says:

Here is a sample line:
186.16.79.248 – - [02/Apr/2009:14:22:09 -0500] “GET /musicas HTTP/1.1″ 200 2497349 “http://www.rol.com.py/wimpy2/rave.swf?cachebust=1238699531218″ “Mozilla/4
.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.2)” 592

Thanks.-

Simone says:

This log line seems to be really similar to an Apache combined log format. The only difference I can see is the 592 token at the end of the string. In the combined log format the User Agent closes the line.

What does this element refer to? You can try to pass a custom log format such as

format = '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %w'

%w can be anything else, but be sure you are not using an existing log directive unless the 592 matches one of the elements described at http://httpd.apache.org/docs/2.2/mod/mod_log_config.html

toro says:

Hi,
I think that this library is wonderful and very convenient.
But I might have found one problem.

when log format is NCSA extended/combined,
%u ( the userid of the person requesting the document)
always causes the “parse error”.

I think that the cause of the problem is here.

http://code.simonecarletti.com/repositories/entry/apachelogregex/lib/apache_log_regex.rb#L167

Not
when element == ‘%U’
but
when element == ‘%u’

I might be wrong… but after I change it , no error .
thanks.

Patrick May says:

Great library! Was a life saver dealing with a recent performance issue, where I needed to crunch the logs to find the culprint. Thanks!

Hpatoio says:

Hello. I’m trying to parse an Nginx log fil with your gem.

The log line looks like this:

194.244.230.4 – http://www.iliveinperego.com [25/Mar/2010:00:00:22 +0100] “GET /ultimo.html HTTP/1.1″ 200 873 “http://www.facebook.com/home.php?” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; it; rv:1.9.2)” “151.32.14.154″ [172.16.23.11:80]

My log log_format is

format = ‘%h – %v %t \”%r\” %s %b \”%{Referer}i\” \”%{User-Agent}i\” \”%R\” \[%BA\]‘

I’ve just used BA as random code, I don’t need that value.

Everything works, there are cases where the parser fails :

- No Remote addres [-] instead of 151.32.14.154 in the eaxample before

- Multiple value for the last field [172.16.23.11:80, 172.16.23.12:80, 172.16.23.13:80] instead of [172.16.23.11:80]

is there a way to tell the logger to forget everything after a certain point ?

thanks


Simone

Add a Comment




Follow Me
    Random Quote