Author: Wallix




LogsParser is an opensource python library created by Wallix ( ).
It is used as the core mechanism for logs tagging and normalization by Wallix's LogBox
( ).

Logs come in a variety of formats. In order to parse many different types of
logs, a developer used to need to write an engine based on a large list of complex
regular expressions. It can become rapidly unreadable and unmaintainable.

By using LogsParser, a developer can free herself from the burden of writing a
log parsing engine, since the module comes in with "batteries included".
Furthermore, this engine relies upon XML definition files that can be loaded at
runtime. The definition files were designed to be easily readable and need very
little skill in programming or regular expressions, without sacrificing
powerfulness or expressiveness.


The LogsParser module uses normalization definition files in order to tag
log entries. The definition files are written in XML.

The definition files allow anyone with a basic understanding of regular
expressions and knowledge of a specific log format to create and maintain
a customized pool of parsers.

Basically a definition file will consist of a list of log patterns, each
composed of many keywords. A keyword is a placeholder for a notable and/or 
variable part in the described log line, and therefore associated to a tag
name. It is paired to a tag type, e.g. a regular expression matching the
expected value to assign to this tag. If the raw value extracted this way needs
further processing, callback functions can be applied to this value.

This format also allows to add useful meta-data about parsed logs, such as
extensive documentation about expected log patterns and log samples.

Format Description

A normalization definition file must strictly follow the specifications as
they are detailed in the file normalizer.dtd .

A simple template is provided to help parser writers get started with their
task, called normalizer.template.

Most definition files will include the following sections :

* Some generic documentation about the parsed logs : emitting application,
  application version, etc ... (non-mandatory)
* the definition file's author(s) (non-mandatory)
* custom tag types (non-mandatory)
* callback functions (non-mandatory)
* Prerequisites on tag values prior to parsing (non-mandatory)
* Log pattern(s) and how they are to be parsed
* Extra tags with a fixed value that should be added once the parsing is done


The definition file's root must hold the following elements :

* the normalizer's name.
* the normalizer's version.
* the flags to apply to the compilation of regular expressions associated with
  this parser : unicode support, multiple lines support, and ignore case.
* how to match the regular expression : from the beginning of the log line (match)
  or from anywhere in the targeted tag (search)
* the tag value to parse (raw, body...)
* the service taxonomy, if relevant, of the normalizer. See the end of this
  document for more details.
* the optional expandWhitespaces value. If set to "yes", spaces or carriage returns
   in a pattern will be converted to match any whitespace character (line feed, tab, etc). 
   This is to maintain the readability of multiline patterns.

Default tag types

A few basic tag types are defined in the file common_tagTypes.xml . In order
to use it, it has to be loaded when instantiating the Normalizer class; see the
class documentation for further information.

Here is a list of default tag types shipped with this library.

* Anything : any character chain of any length.
* Integer
* EpochTime : an EPOCH timestamp of arbitrary precision (to the second and below).
* syslogDate : a date as seen in syslog formatted logs (example : Mar 12 20:13:23)
* MACAddress
* Email
* IP
* ZuluTime : a "Zulu Time"-type timestamp (example : 2012-12-21T13:45:05)

Custom Tag Types