How to create maintainable and reusable logstash pipelines

February 26, 2020

Introduction

Logstash is an open source data processing pipeline that ingests events from one or more inputs, transforms them, and then sends each event to one or more outputs. Some Logstash implementations include many lines of code and process events from multiple input sources. In order to make such implementations more maintainable, I will show how to increase code reusability by creating pipelines from modular components.

Motivation

It is often necessary for Logstash to apply a common subset of logic to events from multiple input sources. This is commonly achieved in one of the following two ways:

  1. Process events from several different input sources in a single pipeline so that common logic can easily be applied to all events from all sources. In such implementations, in addition to the common logic there is usually a significant amount of conditional logic. This approach therefore may result in Logstash implementations that are complicated and difficult to understand.
  2. Execute a unique pipeline for processing events from each unique input source. This approach requires duplicating and copying common functionality into each pipeline, which makes it difficult to maintain the common portions of the code.

The technique presented in this blog addresses the shortcomings in the above approaches by storing modular pipeline components in different files, and then constructing pipelines by combining these components. This technique can reduce pipeline complexity and can eliminate code duplication.

Modular pipeline construction

A Logstash configuration file consists of inputs, filters, and outputs which are executed by a Logstash pipeline.  In more advanced setups it is common to have a Logstash instance executing multiple pipelines. By default, when you start Logstash without arguments, it will read a file called pipelines.yml and will instantiate the specified pipelines.

Logstash inputs, filters, and outputs can be stored in multiple files which can be selected for inclusion into a pipeline by specifying a glob expression. The files that match a glob expression will be combined in alphabetical order. As the order of execution of filters is often important, it may be helpful to include numeric identifiers in file names to ensure that files are combined in the desired order.

Below we will define two unique pipelines that are a combination of several modular logstash components. We store our logstash components in the following files:

  • Input declarations: 01_in.cfg, 02_in.cfg
  • Filter declarations: 01_filter.cfg, 02_filter.cfg, 03_filter.cfg
  • Output declarations: 01_out.cfg

Using glob expressions, we then define pipelines in pipelines.yml to be composed of the desired components as follows:

- pipeline.id: my-pipeline_1   
  path.config: "<path>/{01_in,01_filter,02_filter,01_out}.cfg" 
- pipeline.id: my-pipeline_2   
  path.config: "<path>/{02_in,02_filter,03_filter,01_out}.cfg"

In the above pipelines configuration, the file 02_filter.cfg is present in both pipelines, which demonstrates how the code that is common to both pipelines can be defined and maintained in a single file and also be executed by multiple pipelines.

Testing the pipelines

In this section we provide a concrete example of the files that will be combined into the unique pipelines defined in the above pipelines.yml. We then run logstash with these files, and present the generated output. 

Configuration files

Input file: 01_in.cfg

This file defines an input that is a generator. Note that the generator input is designed for testing Logstash, and in this case it will generate a single event.

input {
  generator {
    lines => ["Generated line"]
    count => 1
  }
}

Input file: 02_in.cfg

This file defines a Logstash input that listens on stdin.

input {
  stdin {}
}

Filter file: 01_filter.cfg

filter {
  mutate {
    add_field => { "filter_name" => "Filter 01" }
  }
}

Filter file: 02_filter.cfg

filter {
  mutate {
    add_field => { "filter_name" => "Filter 02" }
  }
}

Filter file: 03_filter.cfg

filter {
  mutate {
    add_field => { "filter_name" => "Filter 03" }
  }
}

Output file: 01_out.cfg

output {
  stdout { codec => "rubydebug" }
}

Execute the pipeline

Starting Logstash without any options will execute the pipelines.yml file that we previously defined. Run Logstash as follows:

./bin/logstash

As the pipeline my-pipeline_1  is executing a generator to simulate an input event, we should see the following output as soon as Logstash has finished initializing. This shows that the contents of 01_filter.cfg and 02_filter.cfg are executed by this pipeline as expected.

{
       "sequence" => 0,
           "host" => "alexandersmbp2.lan",
        "message" => "Generated line",
     "@timestamp" => 2020-02-05T22:10:09.495Z,
       "@version" => "1",
    "filter_name" => [
        [0] "Filter 01",
        [1] "Filter 02"
    ]
}

As the other pipeline called my-pipeline_2 is waiting for input on stdin, we have not seen any events processed by that pipeline yet. Type something into the terminal where Logstash is running, and press return to create an event for this pipeline. Once you have done this, you should see something like the following:

{
    "filter_name" => [
        [0] "Filter 02",
        [1] "Filter 03"
    ],
           "host" => "alexandersmbp2.lan",
        "message" => "I’m testing my-pipeline_2",
     "@timestamp" => 2020-02-05T22:20:43.250Z,
       "@version" => "1"
}

We can see from the above that the contents of 02_filter.cfg and 03_filter.cfg are applied as expected.  

Order of execution

Be aware that Logstash does not pay attention to the order of the files in the glob expression. It only uses the glob expression to determine which files to include, and then orders them alphabetically. That is to say, even if we were to change the definition of my-pipeline_2 so that 03_filter.cfg appears in the glob expression before 02_filter.cfg, each event would pass through the filter in 02_filter.cfg before the filter defined in 03_filter.cfg.

Conclusion

Using glob expressions allows Logstash pipelines to be composed from modular components, which are stored as individual files. This can improve code maintainability, reusability, and readability.    As a side note, in addition to the technique documented in this blog, pipeline-to-pipeline communication should also be considered to see if it can improve Logstash implementation modularity.