Skip to content

Collecting Logs

Collection Layer

Systems
Apps
  • mozlog
    • Basic set of fields encoded in JSON format
  • OWASP Logging Cheat Sheet
    • High-level list of events an application should record:
      • Input validation failures; for example, protocol violations, unacceptable encodings, invalid parameter names and values
      • Output validation failures such as database-record-set mismatch, invalid data encoding
      • Authentication successes and failures
      • Authorization (access control) failures
      • Session management failures; for example, cookie session identification-value modification
      • Application errors and system events such as syntax and runtime errors, connectivity problems, performance issues, third-party service error messages, filesystem errors, file upload virus detection, configuration changes
      • Application and related systems start-ups and shut-downs, and logging initialization (starting, stopping, or pausing)
      • Use of higher-risk functionality; for example, network connections, adding or deleting users, changes to privileges, assigning users to tokens, adding or deleting tokens, use of systems administrative privileges, access by application administrators, all actions by users with administrative privileges, access to payment-cardholder data, use of data-encrypting keys, key changes, creation and deletion of system-level objects, data import and export including screen-based reports, submission of user-generated content—especially file uploads
      • Legal and other opt-ins such as permissions for mobile phone capabilities, terms of use, terms and conditions, personal data-usage consent, permission to receive marketing communications
  • OWASP AppSensor
    • Outlines a sophisticated method by which applications can detect and respond to attacks using complex logging- and event-analysis techniques
Infrastructure
  • CloudTrail
  • NetFlow

Streaming Layer

Message Broker
  • Application that receives messages from publishers and routes them to consumers
  • Pipe with some smart logic to decide which consumer gets a copy of a given message
  • Different message broker software tools provide different reliability rules on published messages
NSQ provides no reliability guarantee, and a process crash will lose messages
RabbitMQ can guarantee that messages are duplicated on more than one member of the message-broker cluster before acknowledging acceptance
Apache Kafka goes further and not only replicates messages but also keeps a history log of messages across cluster nodes for a configurable period
AWS Kinesis provides similar capabilities and is entirely operated by AWS

Analysis Layer

Common Pattern
  • Common pattern is to run consumers as small plugins executed on top of a log-processing system
    • like Fluentd or Logstash
    • they provide generic features to
      1. consume logs from various message brokers
      2. pass them through custom analysis plugins defined by the operator
      3. write the output to a destination of choice
  • Consumers in a logging pipeline are primarily focused on three types of tasks:
    • Log transformation and storage
    • Metrics and stats computed to provide the DevOps team with visibility over the health of their services
    • Anomaly detection, which includes detecting attacks and fraud
In-memory Databases
  • Multiple consumers can share state through a dedicated database
  • In-memory databases (memcache and Redis) are commonly used
  • Consumers of the same type process messages and update a state maintained in the database
  • These databases aren’t meant for long-term data storage, but only if the state is short-lived and can be lost without massively impacting the reliability of the logging pipeline

Storage Layer

Storage Layers
  • A logging pipeline’s storage layer should provide interfaces that the DevOps team can easily plug into for accessing their data
  • They’re usually of three types:
    • The grep server - is the classic type of log storage: a server with lots of disk space where operators can use command-line tools to explore logs.
    • Document databases - are another popular choice, with Elasticsearch as the common storage engine
    • Relational databases - and particularly data warehouses, are also a popular choice often found in business intelligence (BI) and security-incident and event-management (SIEM) solutions
Lifecycle
  • Raw logs are stored on the grep server for 30 days
    • Every night, a periodic job rotates log files, compresses the files of the day that ended, and publishes the compressed file to an archive
    • After 30 days, the compressed log files are deleted from disk
  • Raw logs are also written into an Elasticsearch database by a different consumer
    • Logs are stored in indexes that represent the current day
    • Indexes are kept for 15 days and deleted thereafter
  • Metrics are computed by a third consumer and stored in their own database
    • Metrics are never deleted because their volume is low enough that they can be stored forever
    • This gives engineers the ability to compare trends from year to year
  • In the archive, compressed log files are stored in folders by year and month
    • If required for cost reasons, logs can be deleted after a given period that shouldn’t be shorter than three months
    • Both AWS S3 and Glacier provide automated lifecycle-management features to delete data after a specified period of time

Access Layer

Tools
  • Kibana and Elasticsearch
  • Prometheus and Grafana