Using Grok with Elasticsearch to add structure to your data

July 13, 2020

Introduction

As well as being a search engine, Elasticsearch is also a powerful analytics engine. However in order to take full advantage of the near-real-time analytics capabilities of Elasticsearch, it is often useful to add structure to your data as it is ingested into Elasticsearch. The reasons for this are explained very well in the schema on write vs. schema on read article, and for the remainder of this blog, when I talk about structuring data, I am referring to schema on write.

Because of the importance of structuring your data, in this blog I will show you how to add structure to unstructured documents by using an ingest node with the Grok Processor. Then, I will describe a simple method to construct new Grok patterns, and a method that can be used to debug errors in existing Grok patterns. Finally I will provide links to some publicly available Grok patterns and then briefly mention the Dissect Processor as a possible alternative to Grok.

As a side note, if you are going to put in the effort to structure your data, you should consider structuring your data so that it conforms to the Elastic Common Schema, which will facilitate the analysis of data from diverse sources.

An example of adding structure to unstructured data

It is not uncommon to see documents sent to Elasticsearch that are similar to the following.:

{
"message": "55.3.244.1 GET /index.html 15824 0.043 other stuff"
}

The message field in the above document contains unstructured data. It is a series of words and numbers that are not suitable for near-real-time analytics. In order to take full advantage of the powerful analytics capabilities of Elasticsearch, we should parse the message field to extract relevant data. For example, we could extract the following fields from the above message:

"host.ip": "55.3.244.1" 
"http.request.method": "GET"
"url.original": "/index.html"
"http.request.bytes": 15824
"event.duration": 0.043

Adding such a structure will allow you to unleash the full power of Elasticsearch on your data.

Using Grok to structure data

Grok is a tool that can be used to extract structured data out of a given text field within a document. You define a field to extract data from, as well as the Grok pattern for the match. Grok sits on top of regular expressions. However, unlike regular expressions, Grok patterns are made up of reusable patterns, which can themselves be composed of other Grok patterns. 

Before going into details of how to build and debug your own Grok patterns, we first give a quick overview of what a Grok pattern looks like, how it can be used in an ingest pipeline, and how it can be simulated. Don’t worry if you don’t fully understand the details of the Grok expression yet, as these details will be discussed in-depth in the following sections of this blog.

In the previous section we presented an example document that looks as follows:

{
"message": "55.3.244.1 GET /index.html 15824 0.043 other stuff"
}


The desired structure can extracted from this example message field by using the following Grok expression:

%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}

And we define a pipeline which contains this Grok pattern inside a Grok processor.

PUT _ingest/pipeline/example_grok_pipeline
{
"description": "A simple example of using Grok",
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}"
]
}
}
]
}

We can then simulate the above pipeline with the following command.

POST _ingest/pipeline/example_grok_pipeline/_simulate
{
"docs": [
{
"_source": {
"message": "55.3.244.1 GET /index.html 15824 0.043 other stuff"
}
}
]
}

Which responds with a structured document that looks as follows: 

{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"host" : {
"ip" : "55.3.244.1"
},
"http" : {
"request" : {
"method" : "GET",
"bytes" : 15824
}
},
"message" : "55.3.244.1 GET /index.html 15824 0.043 other stuff",
"event" : {
"duration" : 0.043
},
"url" : {
"original" : "/index.html"
}
},
"_ingest" : {
"timestamp" : "2020-06-24T22:41:47.153985Z"
}
}
}
]
}


This document contains the original unstructured  message field, and it also contains all of the additional fields which have been extracted from the message. We now have a document that contains structured data!

A note on ingest pipelines 

In the above example we simulated execution of an ingest pipeline that contains our Grok pattern, but didn’t actually run it on any real documents. An ingest pipeline is designed to process documents at ingest time, as described in the ingest node documentation. One way to execute an ingest pipeline is by adding the pipeline name to the PUT command as follows: 

PUT example_index/_doc/1?pipeline=example_grok_pipeline
{
  "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff"
}

And the document that has been written can be seen by executing:

GET example_index/_doc/1

Which will respond with the following:

{
"_index" : "example_index",
"_type" : "_doc",
"_id" : "1",
"_version" : 2,
"_seq_no" : 2,
"_primary_term" : 1,
"found" : true,
"_source" : {
"host" : {
"ip" : "55.3.244.1"
},
"http" : {
"request" : {
"method" : "GET",
"bytes" : 15824
}
},
"message" : "55.3.244.1 GET /index.html 15824 0.043 other stuff",
"event" : {
"duration" : 0.043
},
"url" : {
"original" : "/index.html"
}
}
}

Alternatively (and likely preferably), the ingest pipeline can be applied by default to all documents that are written to a given index by adding it to the index settings:

PUT example_index/_settings
{
  "index.default_pipeline": "example_grok_pipeline"
}

 

After adding the pipeline to the settings, any documents that are written to example_index will automatically have the example_grok_pipeline applied to them. 

This can be verified by writing a new document to example_index as follows:

PUT example_index/_doc/2
{
  "message": "66.3.244.1 GET /index.html 500 0.120 new other stuff"
}

And the document that has been written can be seen by executing:

GET example_index/_doc/2

Which, as expected will return the document that we just wrote. This document has the new fields that were extracted from the message field:

{
"_index" : "example_index",
"_type" : "_doc",
"_id" : "2",
"_version" : 3,
"_seq_no" : 2,
"_primary_term" : 1,
"found" : true,
"_source" : {
"host" : {
"ip" : "66.3.244.1"
},
"http" : {
"request" : {
"method" : "GET",
"bytes" : 500
}
},
"message" : "66.3.244.1 GET /index.html 500 0.120 new other stuff",
"event" : {
"duration" : 0.12
},
"url" : {
"original" : "/index.html"
}
}
}

Understanding the Grok pattern

In the previous section, we presented an example document with the following structure:

{
"message": "55.3.244.1 GET /index.html 15824 0.043 other stuff"
}

And we then used the following Grok pattern to extract structured data from the message field:

"%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}"

As described in the Grok Processor documentation, the syntax for Grok patterns comes in three forms: %{SYNTAX:SEMANTIC}, %{SYNTAX}, %{SYNTAX:SEMANTIC:TYPE}, all of which we can see in the above Grok pattern. 

  • The SYNTAX is the name of the pattern that will match your text. Built-in SYNTAX patterns can be seen on github.
  • The SEMANTIC is the name of the field that will store the data that matches the SYNTAX pattern.
  • The TYPE is the data type you wish to cast your named field.

The first part of the Grok pattern is the following:

%{IP:host.ip}

This declaration matches an IP address (corresponding to the IP Grok pattern) and stores it in a field called host.ip. Four our example data, this will extract a value of 55.3.244.1 and store it in the host.ip field.

If we want more details on the IP Grok pattern, we can look into the Grok patterns on Github, and we will see the following definition: 

IP (?:%{IPV6}|%{IPV4})

This means that the IP pattern will match one of the IPV6 or IPV4 Grok patterns. To understand what the IPV6 and IPV4 patterns are, once again we can look into the Grok patterns on Github to see their definitions, and so on. 

The next part of the Grok pattern is a single whitespace character followed by the following expression:

%{WORD:http.request.method}

This portion of the Grok expression extracts the word GET from the message and stores it into the http.request.method field. If we want to understand the definition of the WORD pattern, we can look at the Grok patterns on Github

One can do the same kind of analysis to understand the patterns that match the url.original, request.bytes and event.duration fields, which we leave as an exercise for the reader

Finally, the last statement in the Grok pattern is the following:

%{GREEDYDATA}

This expression does not have a SEMANTIC part, which means that the matching data is not stored into any field.  Additionally, the GREEDYDATA Grok pattern will consume as much text as it can, which means that in our example it will match everything after the event.duration field. The GREEDYDATA expression will come in handy when debugging complex Grok patterns, as discussed in the following sections of this blog.  

Incrementally constructing a new Grok pattern

When constructing a new Grok pattern, it is often easiest to construct the Grok pattern incrementally starting from the left and working towards the right side of the unstructured text that we are trying to match. 

Two tools that can be helpful for building and debugging Grok patterns are the simulate pipeline API which we used earlier in this article, and Kibana’s Grok debugger. The incremental construction method shown here will work equally well with either tool.

Let’s assume that we are told to write a Grok pattern to parse the following message (the same one as before):

"55.3.244.1 GET /index.html 15824 0.043 other stuff"

Let’s also assume that we have been told to structure the above data into ECS-compliant field names, and we have been given the following information about the above message: 

  • The first token is a host IP address.
  • The second token is an http request method.
  • The third token is a URI.
  • The fourth token is the size of the request in bytes.
  • The fifth token is the event duration.
  • The remaining text is just some additional text that we don’t care about. 

Based on these instructions, for the above message we would like to have the following ECS-compliant fields extracted, as we discussed earlier in this blog: 

"host.ip": "55.3.244.1" 
"http.request.method": "GET"
"url.original": "/index.html"
"http.request.bytes": 15824
"event.duration": 0.043

Remember that we are working incrementally to build up the Grok expression from left to right. So, let’s start by seeing if we can pull out the IP address from the message. We will use the IP Grok pattern to match the host.ip field, and the GREEDYDATA pattern to capture everything after the IP address. This would look as follows:

%{IP:host.ip}%{GREEDYDATA:my_greedy_match}

Let’s use Kibana’s Grok debugger to see if this Grok pattern is able to parse the message. This would look as follows:

Which has worked as expected – the host.ip field has been correctly extracted and the remainder of the message has been stored in my_greedy_match

Let’s add in the next part of the Grok pattern. We know that this is the http.request.method field, which is a WORD Grok pattern. We therefore augment our Grok pattern as follows:

%{IP:host.ip}%{WORD:http.request.method}%{GREEDYDATA:my_greedy_match}

However, as shown below, testing this in Kibana’s debugger gives an empty response. This is not what we expected! 

The reason for the empty response is because the pattern didn’t match. This is because the message has a space between the host.ip (in this example 55.3.244.1) and the request.method (in this example GET), but we did not include a space in the Grok pattern. Let’s fix this error and try again with the following Grok pattern. 

%{IP:host.ip} %{WORD:http.request.method}%{GREEDYDATA:my_greedy_match}

And test it in Kibana as follows:

This has worked! We have now extracted both the host.ip and the http.request.method fields. But we still have work to do to parse the remaining fields. We can continue to incrementally add to our Grok pattern until we end up with the following Grok pattern:

%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA:my_greedy_match}

Which we can test in Kibana as follows:

It works as expected! However, (for this example) we are not interested in keeping the my_greedy_match field around, and so we can remove this from our Grok expression as follows:

%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}

Which will look like the following in Kibana:

This looks exactly how we want it to look! We now have a Grok pattern that we can use to structure the data that is contained in the message field.  

Using divide-and-conquer to debug a broken Grok pattern

We can also use the simulate pipeline API or Kibana’s Grok debugger to help us debug a broken Grok pattern. The divide-and-conquer method described below will work equally well with either of these tools, and should help you to quickly help find the reason that a Grok pattern is not matching your data.

Let’s imagine that we are trying to parse a relatively long message, such as the following message which is an entry from an Elasticsearch slow log.  

[2020-05-14T20:39:29,644][INFO ][index.search.slowlog.fetch.S-OhddFHTc2h6w8NDzPzIw] [instance-0000000000] [kibana_sample_data_flights][0] took[3.7ms], took_millis[3], total_hits[416 hits], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[1], source[{"query":{"match":{"DestCountry":{"query":"AU","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","auto_generate_synonyms_phrase_query":true,"boost":1.0}}}}], id[],  

And let’s assume that we have found the following Grok pattern on the internet, which we have been told should parse Elasticsearch slow logs, but for some reason it isn’t working! 

\[%{TIMESTAMP_ISO8601:event.end}\]\[%{LOGLEVEL:log.level}\s*\]\[%{DATA:slowlog.type}\]\s*\[%{DATA:host.name}\]\s*\[%{DATA:slowlog.index}\]\s*\[%{DATA:slowlog.shard:int}]took\[%{DATA:slowlog.took}\],\stook_millis\[%{DATA:slowlog.took_millis:float}\],\stotal_hits\[%{DATA:slowlog.total_hits:int}\shits\]\,\stypes\[%{DATA:slowlog.types}\],\sstats\[%{DATA:slowlog.stats}\],\ssearch_type\[%{DATA:slowlog.search_type}\],\stotal_shards\[%{DATA:slowlog.total_shards:int}\],\ssource\[%{GREEDYDATA:slowlog.source}\],\sid\[%{DATA:slowlog.x-opaque-id}\]

We can use Kibana’s Grok Debugger to try to figure out where the error is. We paste the data and the Grok pattern into the Grok Debugger as shown below: 

The structured data response is empty, which confirms that the Grok pattern did not match the sample data. Let’s make sure that the Grok Debugger is working by defining a pattern that we know will match anything, and store the result in a field called my_greedy_match. This can be accomplished by defining a grok pattern as: %{GREEDYDATA:my_greedy_match}. This will result in an output that looks like the following:

For this pattern, Grok has stored the entire contents of the sample data into a field called my_greedy_match, which is what we expected for this test. 

Next we start a divide-and-conquer approach to figure out where the error is in our Grok pattern. We can do this by copying approximately the first half of the broken Grok pattern into a new expression, and replacing the second half with the GREEDYDATA expression that we just saw. This new Grok pattern would looks as follows: 

\[%{TIMESTAMP_ISO8601:event.end}\]\[%{LOGLEVEL:log.level}\s*\]\[%{DATA:slowlog.type}\]\s*\[%{DATA:host.name}\]\s*\[%{DATA:slowlog.index}\]\s*\[%{DATA:slowlog.shard:int}]took\[%{DATA:slowlog.took}\],\stook_millis\[%{DATA:slowlog.took_millis:float}\],%{GREEDYDATA:my_greedy_match}

After pasting this Grok Pattern into Kibana’s Grok Debugger, we see that the Structured Data response is still empty. 

This means that the error is in the first half of the Grok pattern. So let’s divide it in half again as follows:

\[%{TIMESTAMP_ISO8601:event.end}\]\[%{LOGLEVEL:log.level}\s*\]\[%{DATA:slowlog.type}\]\s*\[%{DATA:host.name}\]\s*%{GREEDYDATA:my_greedy_match}

Pasting this into Kibana’s debugger as follows shows that structured data has been correctly extracted: 

We now know that there is not an error in the first quarter of the Grok pattern, and that there is an error before the midpoint of the Grok pattern. So let’s put the GREEDYDATA expression at approximately the three-eights location of the original Grok pattern, as follows:

\[%{TIMESTAMP_ISO8601:event.end}\]\[%{LOGLEVEL:log.level}\s*\]\[%{DATA:slowlog.type}\]\s*\[%{DATA:host.name}\]\s*\[%{DATA:slowlog.index}\]\s*\[%{DATA:slowlog.shard:int}]%{GREEDYDATA:my_greedy_match}

This will look as follows in Kibana’s debugger – which is a match

So we know that the error is somewhere between the three-eights point, and the midpoint of the Grok pattern. Lets try adding back in a bit more of the original Grok pattern as follows:

\[%{TIMESTAMP_ISO8601:event.end}\]\[%{LOGLEVEL:log.level}\s*\]\[%{DATA:slowlog.type}\]\s*\[%{DATA:host.name}\]\s*\[%{DATA:slowlog.index}\]\s*\[%{DATA:slowlog.shard:int}]took%{GREEDYDATA:my_greedy_match}

Which returns an empty response as shown in Kibana’s debugger below. 

So there is something wrong after the extraction of the slowlog.shard.int value. If we re-examine the message that we are parsing, we will see that the took string should have a whitespace character in front of it. Lets modify the Grok pattern to see if it works when we specify a whitespace in front of took, as follows:

\[%{TIMESTAMP_ISO8601:event.end}\]\[%{LOGLEVEL:log.level}\s*\]\[%{DATA:slowlog.type}\]\s*\[%{DATA:host.name}\]\s*\[%{DATA:slowlog.index}\]\s*\[%{DATA:slowlog.shard:int}]\stook%{GREEDYDATA:my_greedy_match}

It works, as shown below.

But we still have a bunch of data stored in my_greedy_match. Lets add back the remainder of the original Grok pattern as follows: 

\[%{TIMESTAMP_ISO8601:event.end}\]\[%{LOGLEVEL:log.level}\s*\]\[%{DATA:slowlog.type}\]\s*\[%{DATA:host.name}\]\s*\[%{DATA:slowlog.index}\]\s*\[%{DATA:slowlog.shard:int}]\stook\[%{DATA:slowlog.took}\],\stook_millis\[%{DATA:slowlog.took_millis:float}\],\stotal_hits\[%{DATA:slowlog.total_hits:int}\shits\]\,\stypes\[%{DATA:slowlog.types}\],\sstats\[%{DATA:slowlog.stats}\],\ssearch_type\[%{DATA:slowlog.search_type}\],\stotal_shards\[%{DATA:slowlog.total_shards:int}\],\ssource\[%{GREEDYDATA:slowlog.source}\],\sid\[%{DATA:slowlog.x-opaque-id}\]

And then paste the Grok pattern into Kibana’s Grok Debugger as follows: 

The Grok pattern is working! We have now extracted structured data from the previously unstructured slowlog entry. 

Publicly available Grok patterns

Using basic Grok patterns, you can build up complex patterns to match your data. Furthermore, the Elastic Stack ships with many reusable grok patterns. See Ingest node grok patterns and Logstash grok patterns for the complete list of patterns.

Alternative to Grok

In some cases it may be possible to use the Dissect Processor to extract structured fields out of a single text field. Similar to the Grok Processor, dissect also extracts structured fields out of a single text field within a document. However unlike the Grok Processor, dissect does not use Regular Expressions. This allows dissect’s syntax to be simple and it may be faster than the Grok Processor.

Conclusion

In this blog I showed how Grok can be used for structuring unstructured data. Next I showed a simple method for incrementally constructing Grok patterns, followed by a method for debugging errors in existing Grok patterns. Finally I provided links to some publicly available Grok patterns and briefly mentioned the Dissect Processor as an alternative to Grok. 

%d bloggers like this: