This is going to be a really fruitful month for me. I completed a couple of long-time standing activities and I finally had some time to go back working on my Ruby GEMS. After the third version of my Ruby client for delicious API, this is the turn of Apache Log Regex.
ApacheLogRegex is designed to be a simple Ruby class to parse Apache log files. It takes an Apache logging format and generates a regular expression which is used to parse a line from a log file and returns a Hash with keys corresponding to the fields defined in the log format.
Take for example the following Apache log entry.
1 | 87.18.183.252 - - [13/Aug/2008:00:50:49 -0700] "GET /blog/index.xml HTTP/1.1" 302 527 "-" "Feedreader 3.13 (Powered by Newsbrain)" |
You can easily parse it with Apache Log Regex and extract only the information you need.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # This is the log line you want to parse line = '87.18.183.252 - - [13/Aug/2008:00:50:49 -0700] "GET /blog/index.xml HTTP/1.1" 302 527 "-" "Feedreader 3.13 (Powered by Newsbrain)"' # Define the log file format. # This information is defined in you Apache log file # with the LogFormat directive format = '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"' # Initialize the parser parser = ApacheLogRegex.new(format) # Get the log line as a Hash parser.parse(line) # => {"%r"=>"GET /blog/index.xml HTTP/1.1", "%h"=>"87.18.183.252", "%>s"=>"302", "%t"=>"[13/Aug/2008:00:50:49 -0700]", "%{User-Agent}i"=>"Feedreader 3.13 (Powered by Newsbrain)", "%u"=>"-", "%{Referer}i"=>"-", "%b"=>"527", "%l"=>"-"} |
If you want more control over the parser you can use the parse! method. It raises a ParseError exception if given line doesn’t match the log format.
1 2 3 4 5 6 7 8 | common_log_format = '%h %l %u %t "%r" %>s %b' parser = ApacheLogRegex.new(common_log_format) # No exception parser.parse(line) # => nil # Raises an exception parser.parse!(line) # => ParseError |
Instead of spending time parsing one line at once you can read entire log files and feed the parser collecting the final result.
1 2 3 | result = File.readlines('/var/apache/access.log').collect do |line| parser.parse(line) end |
Apache Log Regex is a Ruby port of Peter Hickman’s Apache::LogRegex 1.4 Perl module, available at http://cpan.uwinnipeg.ca/~peterhi/Apache-LogRegex.
You can install the library via RubyGems.
1 | $ gem install apachelogregex |
Feel free to email me with any questions or feedback. For the documentation and more details you can visit the ApacheLogRegex project page.
Nice post.
I was looking for something exactly like this
Thanks for the head start.
Very nice gem!
Is there a way to extract the referer from a combined_log string?
combined_log_format = ‘%h %l %u %t “%r” %>s %b “%{Referer}i” “%{User-agent}i”‘
(defined in http://httpd.apache.org/docs/2.0/mod/mod_log_config.html#customlog)
For the moment I’m using
line =~ /(.*) “(.*?)” “(.*?)”$/
Also the 02/12/2009 “Added apachelogregex file to simplify GEM usage” change is not in the standard gem you get when calling “gem install apachelogregex” (version 0.1.0)
Sure!
2
3
4
5
6
7
8
9
10
require "apachelogregex"
format = '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'
parser = ApacheLogRegex.new(format)
referers = File.readlines('access.log').collect do |line|
values = parser.parse!(line)
values["%{Referer}i"]
end
I’m not sure what do you mean. I’ve just run
and I can see the file apachelogregex.rb in the package.
Also, the following code works as expected
2
require "apachelogregex"
Thanks! Sorry, I was wrong about the gem, works as expected.
btw, this ruby script extracts the google query strings from a combined apache log file:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require "apachelogregex"
require "uri"
require 'cgi'
format = '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'
parser = ApacheLogRegex.new(format)
result = STDIN.readlines.collect do |line|
values = parser.parse!(line)
referer = values["%{Referer}i"]
if referer and referer.include?("google")
query = URI.parse(referer).query
puts CGI.parse(query)['q'] if query
end
end
It might be nice to add constants for the commonly used log format strings… Not sure what I did wrong when I tried to use %{Referer}i, probably some quick oversight.
That’s a nice idea.
http://code.simonecarletti.com/issues/show/184
Feel free to comment the ticket with more details.
Can I use it to parse Icecast logs?. They are pretty the same as apache I guess.
It gives me a (ApacheLogRegex::ParseError) …
Any ideas?
Here is a sample line:
186.16.79.248 – - [02/Apr/2009:14:22:09 -0500] “GET /musicas HTTP/1.1″ 200 2497349 “http://www.rol.com.py/wimpy2/rave.swf?cachebust=1238699531218″ “Mozilla/4
.0 (compatible; MSIE 7.0; Windows NT 5.1; InfoPath.2)” 592
Thanks.-
This log line seems to be really similar to an Apache combined log format. The only difference I can see is the 592 token at the end of the string. In the combined log format the User Agent closes the line.
What does this element refer to? You can try to pass a custom log format such as
format = '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %w'%w can be anything else, but be sure you are not using an existing log directive unless the 592 matches one of the elements described at http://httpd.apache.org/docs/2.2/mod/mod_log_config.html
Hi,
I think that this library is wonderful and very convenient.
But I might have found one problem.
when log format is NCSA extended/combined,
%u ( the userid of the person requesting the document)
always causes the “parse error”.
I think that the cause of the problem is here.
http://code.simonecarletti.com/repositories/entry/apachelogregex/lib/apache_log_regex.rb#L167
Not
when element == ‘%U’
but
when element == ‘%u’
I might be wrong… but after I change it , no error .
thanks.
Thank you for your report.
I filed a new issue.
http://github.com/weppos/apachelogregex/issues/#issue/2
Great library! Was a life saver dealing with a recent performance issue, where I needed to crunch the logs to find the culprint. Thanks!
Hello. I’m trying to parse an Nginx log fil with your gem.
The log line looks like this:
194.244.230.4 – http://www.iliveinperego.com [25/Mar/2010:00:00:22 +0100] “GET /ultimo.html HTTP/1.1″ 200 873 “http://www.facebook.com/home.php?” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; it; rv:1.9.2)” “151.32.14.154″ [172.16.23.11:80]
My log log_format is
format = ‘%h – %v %t \”%r\” %s %b \”%{Referer}i\” \”%{User-Agent}i\” \”%R\” \[%BA\]‘
I’ve just used BA as random code, I don’t need that value.
Everything works, there are cases where the parser fails :
- No Remote addres [-] instead of 151.32.14.154 in the eaxample before
- Multiple value for the last field [172.16.23.11:80, 172.16.23.12:80, 172.16.23.13:80] instead of [172.16.23.11:80]
is there a way to tell the logger to forget everything after a certain point ?
thanks
–
Simone
Just wanted to say thanks for this! We’re using it to parse nginx logs, whose default format is identical to Apache. Cheers! :)
Thanks! Using this to parse nginx log – works great.