By default, Logstash uses in-memory bounded queues between pipeline stages (inputs → pipeline workers) to buffer events. However, in order to protect against data loss during abnormal termination, Logstash has a persistent queue feature which can be enabled to store the message queue on disk. The queue sits between the input and filter stages as follows:
input → queue → filter + output
According to the following blog post, Logstash persistent queues should have a small impact on overall throughput. While this is likely true for use cases where the pipeline is CPU bound, it is not always the case.
In a recent Logstash implementation, enabling Logstash persistent queues caused a slowdown of about 75%, from about 40K events/s down to about 10K events/s. Somewhat surprisingly, based on disk I/O metrics it was clear that the disks were not saturated. Additionally, standard Logstash tuning techniques such as testing different batch sizes and adding more worker threads were unable to remedy this slowdown.
Why persistent queues may impact Logstash performance
Investigations showed that the reason that throughput was limited is because a single Logstash pipeline runs a single-threaded persistent queue, or to put it another way, a single Logstash pipeline only drives data to disk from a single thread. This is true even if that pipeline has multiple inputs, as additional inputs in a single pipeline do not increase Disk I/O threads. Furthermore, because enabling the persistent queue adds synchronous disk I/O (wait time) into the pipeline, it reduces throughput even if none of the resources on the system are maxed-out.
Given that Logstash throughput was limited by synchronous disk I/O rather than resource constraints, more threads running in parallel were needed to drive the disks harder and to increase the overall throughput. This was accomplished by running multiple identical pipelines in parallel within a single Logstash process, and then load balancing the input data stream across the pipelines. If data is driven into Logstash by filebeat, load balancing can be done by specifying multiple Logstash outputs in filebeat.
After increasing the number of pipelines to 4 and splitting the input data across these 4 pipelines, Logstash performance with persistent queues increased up to about 30K events/s, or only 25% worse than without persistent queues. At this point the disks were saturated, and no further performance improvements were possible.
As shown in the comments below, this approach has also helped other Logstash users with substantial performance gains. Did this solution help you? If so, please consider leaving a comment below!
4 thoughts on “Improving the performance of Logstash persistent queues”
I don’t exactly understand what you mean by:
“….This was accomplished by splitting the source data into multiple streams, running multiple pipelines in parallel within a Logstash process, and targeting each stream at a different pipeline….”
Do you mean you have increased the number of workers? Of do you mean you some how split one big stream of data into multiple before they enter a pipeline?
Can you provide an example of your logstash.yml, pipelines.yml and a path.config file where you ‘split’ the source data?
My situation: lots of winbeatsagents filling up one pipeline, with mediocre logstatsh EPS performance.
” do you mean you some how split one big stream of data into multiple before they enter a pipeline” – Yes.
For example, if you are using something like filebeat you can specify multiple logstash output destinations which could be parallel pipelines, and which would load balance between the destinations – eg. https://www.elastic.co/guide/en/beats/filebeat/current/load-balancing.html. If you are not using filebeat, then you might have another way of targeting some of your data at one logstash pipeline, and other data at a different logstash pipeline. Does that make sense?
Thanks this helped ! it Almost doubled our Events Received Rate (/s) and Events Emitted Rate (/s). When we added two more pipelines doing the exact same thing.
CPU Utilization (%) went from +-40% to 50%. I think we should even add more pipelines, but for now let’s see what happens in the long run.
This should be added to the tuning guide here: https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html and here: https://www.elastic.co/guide/en/logstash/current/performance-troubleshooting.html
Just to explain for anyone reading this:
What we did, was create 2 extra beats pipelines on the same logstash machine.
1. Add 2 additional pipelines in pipeline.yml
– pipeline.id: beats
– pipeline.id: beats2
– pipeline.id: beats3
2. Copy beats.config as beats2.config & beats3.config where we only modified the (listening) port for each pipeline (so 3 different ports). The rest (filter, output) was untouched. Beatsn.config change:
port => XXXXn
3. Modify the winlogbeats config on the machines where we collect logging, so it outputs to two pipelines (ports) on the same machine and enable load balancing (true). winlogbeat.yml change:
hosts: [“x.x.x.x:xxxn”, “x.x.x.x:xxxn+1”]
Eventually we ended up creating 5 parallel pipelines to triple our throughput.
But more important I would like to send people in the direction of pipeline to pipeline communication. Because sending logs of all different sources to one loaded pipeline is a challenge, but splitting it up to different indices AND keeping your conf files clean and sorted is another thing. So people should definitely look into to pipeline-to-pipeline communication when they connect loads of Beats agents. (Send Tags with the Beats client to identify every category/type of log source)