Now posted on the Elastic blog
November 14, 2018 update: This article has been published on Elastic’s website as: https://www.elastic.co/blog/using-parallel-logstash-pipelines-to-improve-persistent-queue-performance.
By default, Logstash uses in-memory bounded queues between pipeline stages (inputs → pipeline workers) to buffer events. However, in order to protect against data loss during abnormal termination, Logstash has a persistent queue feature which can be enabled to store the message queue on disk. The queue sits between the input and filter stages as follows:
input → queue → filter + output
According to the following blog post, Logstash persistent queues should have a small impact on overall throughput. While this is likely true for use cases where the pipeline is CPU bound, it is not always the case.
In a recent Logstash implementation, enabling Logstash persistent queues caused a slowdown of about 75%, from about 40K events/s down to about 10K events/s. Somewhat surprisingly, based on disk I/O metrics it was clear that the disks were not saturated. Additionally, standard Logstash tuning techniques such as testing different batch sizes and adding more worker threads were unable to remedy this slowdown.
Why persistent queues may impact Logstash performance
Investigations showed that the reason that throughput was limited is because a single Logstash pipeline runs a single-threaded persistent queue, or to put it another way, a single Logstash pipeline only drives data to disk from a single thread. This is true even if that pipeline has multiple inputs, as additional inputs in a single pipeline do not increase Disk I/O threads. Furthermore, because enabling the persistent queue adds synchronous disk I/O (wait time) into the pipeline, it reduces throughput even if none of the resources on the system are maxed-out.
Given that Logstash throughput was limited by synchronous disk I/O rather than resource constraints, more threads running in parallel were needed to drive the disks harder and to increase the overall throughput. This was accomplished by running multiple identical pipelines in parallel within a single Logstash process, and then load balancing the input data stream across the pipelines. If data is driven into Logstash by filebeat, load balancing can be done by specifying multiple Logstash outputs in filebeat.
After increasing the number of pipelines to 4 and splitting the input data across these 4 pipelines, Logstash performance with persistent queues increased up to about 30K events/s, or only 25% worse than without persistent queues. At this point the disks were saturated, and no further performance improvements were possible.
As shown in the comments below, this approach has also helped other Logstash users with substantial performance gains. Did this solution help you? If so, please consider leaving a comment below!