Improving the performance of Logstash persistent queues

Introduction

By default, Logstash uses in-memory bounded queues between pipeline stages (inputs → pipeline workers) to buffer events. However, in order to protect against data loss during abnormal termination, Logstash has a persistent queue feature which can be enabled to store the message queue on disk. The queue sits between the input and filter stages as follows:

input → queue → filter + output

According to the following blog post, Logstash persistent queues should have a small impact on overall throughput. While this is likely true for use cases where the pipeline is CPU bound, it is not always the case.

Motivation

In a recent Logstash implementation, enabling Logstash persistent queues caused a slowdown of about 75%, from about 40K events/s down to about 10K events/s. Somewhat surprisingly, based on disk I/O metrics it was clear that the disks were not saturated. Additionally, standard Logstash tuning techniques such as testing different batch sizes and adding more worker threads were unable to remedy this slowdown.

Why persistent queues may impact Logstash performance

Investigations showed that the reason that throughput was limited is because a single Logstash pipeline runs a single-threaded persistent queue, or to put it another way, a single Logstash pipeline only drives data to disk from a single thread. This is true even if that pipeline has multiple inputs, as additional inputs in a single pipeline do not increase Disk I/O threads. Furthermore, because enabling the persistent queue adds synchronous disk I/O (wait time) into the pipeline, it reduces throughput even if none of the resources on the system are maxed-out.

Solution

Given that Logstash throughput was limited by synchronous disk I/O rather than resource constraints, more threads running in parallel were needed to drive the disks harder and to increase the overall throughput. This was accomplished by running multiple identical pipelines in parallel within a single Logstash process, and then load balancing the input data stream across the pipelines. If data is driven into Logstash by filebeat, load balancing can be done by specifying multiple Logstash outputs in filebeat.

Result

After increasing the number of pipelines to 4 and splitting the input data across these 4 pipelines, Logstash performance with persistent queues increased up to about 30K events/s, or only 25% worse than without persistent queues. At this point the disks were saturated, and no further performance improvements were possible.

Feedback

As shown in the comments below, this approach has also helped other Logstash users with substantial performance gains. Did this solution help you? If so, please consider leaving a comment below!