Apache RDF PipeLogger Agent Beta 1 Released

I have released the beta 1 version of my RDF PipeLogger on GitHub https://github.com/ebremer/PipeLogger under the BSD 3-Clause License.  The program is still being tested but appears to be stable enough for a beta release.  The project was developed in NetBean 7.3 using JDK 1.7.0_17.

My test server currently uses the following LogFormat config with potential simplifications indicated with individual line comments:

LogFormat "[\
a http:Request;\
:referrer \"%{Referer}i\";\
:useragent \"%{User-agent}i\";\
:remotehost \"%h\";\              #will be listed as an ip if resolve hosts is not configured on Apache
:remotehostip \"%a\";\
:remoteuser \"%u\";\               # typically blank if http authentication is not require.  If authentication is never required, leave this line out.
time:inXSDDateTime \"%{%Y-%m-%dT%T%z}t\";\
:canonicalservername \"%v\";\   #this would be the "true name" of the server/virtual host.
:querystring \"%q\";\
:numKArequests %k;\
http:httpVersion \"%H\";\
http:methodName \"%m\";\
:port %p;\          # here for completeness, but can be removed since will are usually only listening on port 80 or port 443
:requestsize %I;\
http:absoluteURI <http://%{Host}i%U>;\   # this triple would be the minimum, the next two can be derived from this one
http:absolutePath \"%U\";\
http:authority <http://%{Host}i>\
] http:resp [\
http:statusCodeValue %>s;\
:requesttime %D;\
:responsesizenh %B;\
:responsesize %O;\
:connectionstatus \"%X\"\
] ." log2rdf

The CustomLog command on the test server is:

CustomLog "|/usr/java/default/bin/java -jar /mnt/exodus/disk1/log/PipeLogger.jar /mnt/exodus/disk1/log/config.ttl > /mnt/exodus/disk1/log/errors" log2rdf

The configuration file for PipeLogger is specified as a turtle-formatted file given as the first parameter to PipeLogger.  Any output of PipeLogger is redirected to a file.  The upload URI for Virtuoso is in the form: http://serverdnsname:8890/sparql-graph-crud with a authentication based version at http://serverdnsname:8890/sparql-graph-crud-auth.  See (http://docs.openlinksw.com/virtuoso/rdfsparql.html#rdfsparqlprotocolendpoint)

General Operation of PipeLogger
Turtle-formatted data is piped from Apache to PipeLogger with out prefixes.  Prefixes could be specified in the httpd.conf file but by prepending them in PipeLogger saves a bit on bandwidth.  PipeLogger adds the RDF prefixes and then accumulates triples until the buffer size is reached and then attempts to flush the data to the quad store.  If the quad store is unavailable, the triples are accumulated further until an absolute maximum is reached.  If, at this point,  the quad store does nto respond, the data is written to a file.  Later, when the quad store responds, data is loaded back into memory and then flushed to the triple store.  A timer attempts to flush the memory buffer ay configured intervals in case the site traffic is low.

Discussion
Converting http logs to RDF and aggregating them to a central quad store enables one to use the powerful SPARQL query language to analyze your logs.  But why not just use the popular Google Analytics?  Google Analytics is excellent for well-behaved traffic, but because of how it works, it's not so good with the misbehaved or hostile traffic.  How so?  Google Analytics works by adding a small snippet of javascript to every web page in a web site that has been tagged with a unique identifier for that site.  When the web page containing the javascript is downloaded by a client it is executed and data is sent back to the Google Analytics servers.  Now in the case of a hostile client, one that is probing the web site for weaknesses, is not required to execute this javascript code, nor will it get any javascript if the page doesn't exist.  For example, a hacker is attempting to determine if a site contains a typical Word Press-specific special function page.  If the page does not exist, the site is not Word Press.  Google Analytics will never see this attempt.  This information can only be obtained by looking at the actual logs for that web page.  In a nutshell, for security information, we would like to see all log data that the http response code is not 200.

Future Work
This analysis would be made more robust with the following:

1) RDF geo-located IP address data
2) RDF IP net block ownership data

These two datasets would enable one to figure out from where in the world connections are coming from and who a particular IP belongs to, or at least, the ISP responsible for an IP.

Additional data can be collected the the logger program in addition to the Apache log information collect.  Information like CPU utilization, memory, disk IO, etc could also be collected as RDF and forwarded to the central quad store.  With this data, a more complete picture of what is happening to a web server can be constructed.

Happy RDF logging! :-)

Tags: