Skip to content

Analyzing Logs

Log Processing

  1. First, it must consume the messages coming from the streaming layer
    • which requires connecting to one or several message brokers using a message-queuing protocol and continuously reading messages from them
  2. The messages must then be converted into a standard format
    • to facilitate processing
    • this standardization allows custom analyzers to work with log events more easily, instead of each having to know how to convert timestamps or IP addresses
  3. Standard messages are then forwarded to analysis plugins
    • routing and multiplexing allow several plugins to receive a copy of a given message
    • plugins run arbitrary code written to achieve specific tasks: compute statistics, flag events containing a given string, and so on
  4. Plugins produce their own outputs sent to specific destinations
    • like: email client, database, local file
    • it’s also possible to chain plugins by reinjecting processed messages into the broker to form an analysis loop
  • We could design an analysis layer as a set of individual programs where each performs all four steps, but that would induce a lot of repetition in the code that handles steps 1, 2, and 4
  • Instead, we should use a tool to handle these steps for us, and focus our energy on writing the custom analysis plugins in step 3
  • A large selection of software can handle the core operations of an analysis layer (handle the processing and standardization of logs, and run custom plugins)
    • Fluentd
    • Logstash
    • Splunk
    • Hindsight (Mozilla)


String Signatures
  • the easiest way to look for patterns of bad activity is to compare logs against lists of known bad strings
  • based on regexes, difficult to maintain
Statistical Models
  • Sliding windows
    • predefined alerting threshold
    • detecting clients that violate limits requires counting requests sent by each client over a given period of time
    • let’s say you want a sliding window that has a 1-minute granularity, and 8-minute retention
    • implementing it requires counting every request received within a given minute and storing that value so you can calculate the total for the last 8 minutes
    • as time progresses by 1minute, you discard the oldest value and add a new value, effectively moving the window forward
  • Circular buffers
    • predefined alerting threshold
    • data structure that implements a sliding window using a fixed-size buffer, where the last entry is followed by the first entry, in a loop
    • maintaining a sliding window inside a circular buffer gives you a way to flag clients who may be sending a large amount of traffic over a given period of time
      • you need to keep one circular buffer per client IP to track the count of requests sent by each client individually
      • in practice, this means maintaining a hash table where the key is the IP of the client and the value is the circular buffer
  • Moving averages
    • let’s say you want an average amount of requests per minute sent by each client of a service
      • you want that average to move over time and cover the last 10 minutes of traffic
      • if you find any client sending two or three times more traffic than the average, you can flag it as suspicious
    • to implement this analyzer, you need two things:
      • a circular buffer to keep track of the last 10 minutes of requests received from all clients
      • a count of unique clients seen over each one-minute period
        • using a Cuckoo filter, a sophisticated hash map designed to provide fast lookups with minimal storage overhead, at the cost of a small false-positive rate
Geographic Data
  • Protect against identity theft
    • the most efficient method for protecting users is to check the origin of their connection and, if it’s too far away from the user’s regular geographical region, require additional login steps
  • Geoprofiling users
    • maintain a geographic profile of each user and store it in a database
      • the two circles around the usual connection area represent various degrees of trust
      • the smaller circle represents the usual connection area of the user
      • the larger circle represents the farthest location from the center of connection and is used as a second level of trust, indicating that it’s not completely unlikely the user may connect from within this larger circle
  • Geoprofiling algorithm
    • observes events coming from a user, obtains the latitude and longitude of the source IP of the event (called geolocating an IP), and checks its database to see if it falls within the usual connection area of the user
      • if it does, the event passes through the filter and the connection is added to the history of the user
      • if it doesn’t, an alert is raised and action is taken
    • once the latitude and longitude of an IP address are known, you need to calculate how far that location is from the normal connection area
      • this is called the haversine formula, used to calculate the distance between 2 points on a sphere
    • store the latitude and longitude of the known geocenter, along with its weight (the number of connections you’ve seen for the user so far)
Anomalies in Known Patterns
  • User-Agent signature
    • keep track of the browsers used regularly and compare the live traffic against the browser history
    • require additional authentication steps when a new browser is detected
  • Anomalous browser
    • detecting impossible, or unlikely, browsers (e.g., “Internet Explorer 6 on Linux”)
  • Interaction patterns
    • people will visit pages in the same order and at the same pace