Logmagic System Administrator's Guide

Carlos Amengual, amengual at informatica.info

Revision date: Mar. 05, 2006

Latest version at the Logmagic web site.

Table of Contents

  1. Overview
  2. Main features
  3. Directory layout
  4. Program usage
  5. Configuration
  6. Supported log file formats
  7. Editing the templates
  8. Computing Visits
  9. DNS Name Resolution
  10. Dependencies on other packages

Overview

Logmagic is a www log analysis software written in the Java™ programming language. It uses Carte as its report template engine, and can generate PDF and XHTML reports.

One of the main purposes of this software is the accurate estimation of the number of visitors that a website has. Most log analyzers make dubious assumptions about what a "visit" means and how to compute it, and others use undocumented procedures. In contrast, Logmagic uses accurate and documented methods to compute the number of visits.

Main features

Directory layout

When you unzip the contents of the Logmagic binary distribution into your computer, you are going to find the following directories:

Logmagic's root directory

Contains the software license, and the logmagic.lcf logging configuration file.

conf

This directory contains the configuration files, including the most important one (default.properties), the default templates, etc.

doc

The documentation, including the Java API.

lib

Must contain the software libraries. Main logmagic.jar file is here, as well as all the dependencies. For your convenience, all the required packages are shipped with this distribution, together with their license files (currently all the licenses of dependent packages are Apache, BSD, and MPL).

logs

The software logging writes here by default. A small sample www log file is there, too.

scripts

A few example scripts to run Logmagic.

src

The source code. If you just want to run the software and get statistics, you do not need to look here.

stats

The default configuration writes statistical reports here; and you may want to change it.

tmp

The default configuration writes the reverse DNS cache file here.

Program usage

The main program accepts the following arguments:

LogMagic [-c<stat.properties>] [-d<base_directory>] [-Dproperty1=value1 -Dproperty2=value2 ...] <logfile1.log> <logfile2.log> ... <logfileN.log>

Options:

File specification arguments: <logfile1.log> <logfile2.log> ... <logfileN.log>

Allows multiple files containing logs for the site and desired period.

Configuration

The following manuals are available, in addition to this guide:

It is recommended that you read both manuals before using the software, but if you just want to see how it works, just run one of the the small sample scripts that come with the software, (logmagic.bat for Windows, logmagic.sh for Unix). They generate a really small example report in the stats/www.example.com directory. To run the scripts successfully, you must first set the system environment variable JAVA_HOME to the directory where your Java runtime environment is, for example C:\Program Files\Java\jre1.5.0_06.

If you want to test in on a real logfile from your web server, the following fast advice should be useful:

Edit one of the above mentioned sample scripts:

Supported log file formats

The usual "common" and "extended" NCSA log file formats generated by Apache httpd and other servers are supported, both uncompressed and GZIP-compressed. To achieve best results, the logfiles should include a session ID string at the end of the lines, like the generated by the mod_usertrack standard Apache module. The recommended Apache log format definition is:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\" %{cookie}n" usertrack

You are encouraged to read your web server documentation for more information.

Editing the templates

The default report templates are located in the conf directory, and have names like pdf_<language>.xte or html_<language>.xte. As the templates use standard XHTML and CSS, many layout changes can be done without knowing about the Carte report format, that just annotates the XHTML through the use of XML namespaces. You only need to take care to not modify the elements and attributes in the Carte namespace, i.e. those with the "xte:" prefix.

However, if you want to make deep changes to the templates, or create completely new ones, you may want to understand the basics of the Carte report format, explained at the Carte Report Writing Manual.

Computing Visits

As mentioned in the User Manual, the web protocol has no notions about differentiating visitors or measuring sessions. However, all log analyzers claim to count "visits" to a website, as this is the most wanted feature. But the results are generally unreliable. Typical inadequate methods used to count visits by other log analyzers are:

As mentioned in the User Manual, the web protocol has no notions about differentiating visitors or measuring sessions. This software uses two methods to compute visits to a website: a basic one, that does not make use of any cookies, and a second one that uses session-tracking cookies and falls back to the first method when it fails.

The hypothesis that this software relies on to compute visits, is that the individual visits are monotonically logged to the logfile. This means that for any pair of requests made from a single client to the web server, the first one is logged before the second one, even if other events from other clients are logged meanwhile. In principle, all web servers respect this simple condition.

When the software processes a log entry, it looks at the IP address it comes from, and after that, any other log entry coming from the same IP address, or the same C class (network mask 255.255.255.0) as any of the known IP addresses is a candidate for being from the same client, i.e. it is a "session candidate". Then, the software looks at the "Referer" field of the log entry. The "Referer" tells from which page the user came to display the currently logged one. Then, the software builds a tree combining the requested files and the referers they come from. Session candidates are tested against the tree, and from this procedure sessions are not only identified, but also all the session information is stored. In the end, we not only know the number of sessions, but also when did they start and finish, the files that were downloaded during it, and when.

The second method looks at a session-identifier cookie (see the section on log file formats), and uses it for session tracking, being closer to traditional methods used by advanced log analyzers, but fails back to the first method when sanity checks fail.

While session cookies are great, the accuracy of the first method should not be underestimated, and a stunning example came up in 1999, when the author had to compute statistics from an SGI machine running Irix. The machine was heavily patched to prepare for Y2K, and a bug was introduced in libc that caused all string representations of IP addresses to be 255.255.255.255. Thus, the ability to take into account the origin IP disappeared completely. The timestamp used by the session cookies still did a great job and the number of sessions could be satisfactorily computed with it. Surprisingly, however, the first method systematically gave nearly the exact same figures as the cookie-based one, showing that the use of the Referer was nearly as accurate as the cookies, even in the most difficult circumstances.

As you may imagine, all the above described computations are not cheap. This software requires a large amount of RAM memory to work (at least 250 MB for a medium-to-heavily used web site), and is significantly slower than other log analyzers. Even on a fast machine, the statistics for a heavily used, large website can take a lot of time. Logmagic does not attempt to be the fastest analyzer, but the most accurate.

DNS Name Resolution

The default Logmagic configuration ships with the dns.lookup parameter set to true in the default.properties file, and then the software will attempt to DNS resolve every IP address it finds. This is going to take a lot of time (orders of magnitude more than without it), during which Logmagic will be holding its memory resources.

Therefore, you may want to resolve the reverse IP addresses before running the log analyzer, using a post-processor like logresolve. And then, run Logmagic with the dns.lookup parameter set to false.

If you still decide to use Logmagic to resolve the IP addresses, you should use a DNS cache to speed up name resolution. With the use of a cache, you just need to resolve the addresses once. You have two DNS cache choices: file and database, that can be switched using the dns.cache_engine parameter. See the Configuration Guide for more details.

Dependencies on other packages

This software depends on these other open source packages, provided by different projects:

From our master site:
Latest versions of carte.jar, css4j.jar, jclf.jar, jclf-www.jar and jclf-data.jar
From jCharts project:
jCharts-0.7.5.jar
The iText project:
itext.jar
The DOM4J project:
dom4j.jar
The Jaxen project:
jaxen.jar
From the Apache Batik project:
batik-css.jar and batik-util.jar
W3C's SAC.
sac.jar
The Apache Logging project:
log4j.jar
From the Jakarta project:
jakarta-oro.jar, commons-digester.jar

As mentioned above, all the required JAR files (and licenses) are included with the full distribution package, so you do not need to download each library.