November 2020

Introduction

In several previous blog posts I have shown how a Painless script can be used to process new documents as they are ingested into an Elasticsearch cluster. In each of these posts I have made use of the simulate pipeline API to test the Painless scripts.

While developing such scripts, it may be helpful to use Painless Lab (Beta) in Kibana to debug Painless scripts. In this blog I will show how to use Painless Lab to develop and debug custom scripts, and then show how these can be then easily copied into ingest pipelines.

Example

In the blog post titled Using Elasticsearch Painless scripting to recursively iterate through JSON fields, we demonstrated how to iterate over all elements in a document, and then delete each field where the value is an empty string. The code was written as a script processor in an ingest pipeline, and then simulated.

When developing this Painless script (before putting the code into an ingest pipeline), Painless Lab can be used to catch syntax errors in real time. The code from that blog can be tested in Painless Lab by entering test values in the “Parameters” tab as demonstrated in the right pane shown below.

There are a few modifications to the ingest pipeline code from the previous blog to get it to execute correctly in Painless Lab.

When used in an ingest processor (which is where this will ultimately execute after it is debugged), the script will expect the “ctx” variable to contain the source of the document that is currently being processed. However, because Painless Lab does not (yet) provide a way of directly passing “ctx” to the script, this can be faked by setting “Parameters” to a JSON document with a field called “ctx” (line 2 in the above diagram on the right) that contains the “real” document as its value. We then create a variable “ctx” in the script which is set to “params.ctx” (line 17 in the above diagram on the left).

You can easily view the output by clicking on the “Output” tab as follows.

Notice that in the above “Output” the result is as expected – the “key3” has been removed as it contained an empty string. Now that we have confirmed that the script is behaving as expected, it will require some modification to get it into a state that can be used in an ingest pipeline. In the above example, line 17 and line 19 would need to be removed. You will then end up with the same script as the one that was demonstrated and verified in Using Elasticsearch Painless scripting to recursively iterate through JSON fields. It is therefore quite straightforward to copy code that has been built in Painless Lab using this technique into an ingest pipeline.

Authors

Alexander Marquardt
Honza Kral

Introduction

Painless is a simple, secure scripting language designed specifically for use with Elasticsearch. It is the default scripting language for Elasticsearch and can safely be used for inline and stored scripts. In one of its many use cases, Painless can modify documents as they are ingested into your Elasticsearch cluster. In this use case, you may find that you would like to use Painless to evaluate every field in each document that is received by Elasticsearch. However, because of the hierarchical nature of JSON documents, how to iterate over all of the fields may be non-obvious.

This blog provides examples that demonstrate how Painless can iterate across all fields in each document that Elasticsearch receives, regardless of wheather fields appear directly in the top-level JSON body, or if they are contained in sub-documents or arrays.

Example one – remove empty fields

The following painless script called “remove_empty_fields” shows how to loop over all elements in a document, and deletes each field where the value is an empty string.

PUT _ingest/pipeline/remove_empty_fields
 {
   "processors": [
     {
       "script": {
         "lang": "painless",
         "source": """

           void iterateAllFields(def x) {
             if (x instanceof List) {
               for (def v: x) {
                 iterateAllFields(v);
               }
             }
             if (!(x instanceof Map)) {
               return;
             }
             x.entrySet().removeIf(e -> e.getValue() == "");
             for (def v: x.values()) {
               iterateAllFields(v);
             }
           }

           iterateAllFields(ctx);
       """
       }
     }
   ]
 }

Notice that we use removeIf in the above code, which will correctly remove fields with an empty string as a value. Using a more naive approach with a for loop to iterate over the fields returned by “x.entrySet()” and then executing remove statement within the for loop to directly delete an element will result in a “ConcurrentModfiicationException”, as you cannot modify the Map as it is being looped over.

We can test the above script with the following call to the simulate pipeline API as follows.

POST _ingest/pipeline/remove_empty_fields/_simulate
 {
   "docs": [
     {
       "_source": {
         "key1": "first value",
         "key2": "some other value",
         "key3": "",
         "sudoc": {
           "a": "abc",
           "b": ""
         }
       }
     },
     {
       "_source": {
         "key1": "",
         "key2": "some other value",
         "list_of_docs": [
           {
             "foo": "abc",
             "bar": ""
           },
           {
             "baz": "",
             "subdoc_in_list": {"child1": "xxx", "child2": ""}
           }
         ]
       }
     }
   ]
 }

Which will return the following results, where each field that contains an empty string has been removed.

{
   "docs" : [
     {
       "doc" : {
         "_index" : "_index",
         "_type" : "_doc",
         "_id" : "_id",
         "_source" : {
           "key1" : "first value",
           "key2" : "some other value",
           "sudoc" : {
             "a" : "abc"
           }
         },
         "_ingest" : {
           "timestamp" : "2020-11-06T10:59:29.105406Z"
         }
       }
     },
     {
       "doc" : {
         "_index" : "_index",
         "_type" : "_doc",
         "_id" : "_id",
         "_source" : {
           "list_of_docs" : [
             {
               "foo" : "abc"
             },
             {
               "subdoc_in_list" : {
                 "child1" : "xxx"
               }
             }
           ],
           "key2" : "some other value"
         },
         "_ingest" : {
           "timestamp" : "2020-11-06T10:59:29.105411Z"
         }
       }
     }
   ]
 }

Example two – remove fields where the field name matches a regular expression

The following painless script called “remove_unwanted_keys” shows how you can remove keys with a name that match a regular expression. In this example, we delete any fields where the field name starts with “unwanted_key_”.

Note that by default regexes are disabled. To load this script you will first need to set “script.painless.regex.enabled” to “true” in “elasticsearch.yml”.

PUT _ingest/pipeline/remove_unwanted_keys
 {
   "processors": [
     {
       "script": {
         "lang": "painless",
         "source": """

           void iterateAllFields(def x) {
             if (x instanceof List) {
               for (def v: x) {
                 iterateAllFields(v);
               }
             }
             if (!(x instanceof Map)) {
               return;
             }
             x.entrySet().removeIf(e -> e.getKey() =~ /unwanted_key_.*/);
             for (def v: x.values()) {
               iterateAllFields(v);
             }
           }

           iterateAllFields(ctx);
       """
       }
     }
   ]
 }

We can then test the above script with the following call to the simulate pipeline API as follows.

POST _ingest/pipeline/remove_unwanted_keys/_simulate
 {
   "docs": [
     {
       "_source": {
         "key1": "first value",
         "key2": "some other value",
         "key3": "",
         "unwanted_key_something": "get rid of this",
         "unwanted_key_2": "this too",
         "sudoc": {
           "foo": "abc",
           "bar": ""
         }
       }
     }
   ]
 }

Which will return the following results, where each field name that started with “unwanted_key_” has been removed.

{
   "docs" : [
     {
       "doc" : {
         "_index" : "_index",
         "_type" : "_doc",
         "_id" : "_id",
         "_source" : {
           "key1" : "first value",
           "key2" : "some other value",
           "key3" : "",
           "sudoc" : {
             "bar" : "",
             "foo" : "abc"
           }
         },
         "_ingest" : {
           "timestamp" : "2020-11-06T11:19:56.839119Z"
         }
       }
     }
   ]
 }

Conclusion

In this blog we have presented two examples of how all elements in a JSON document can be iterated over, regardless of if they are included in the top-level JSON, or within sub-documents or arrays.

Month: November 2020

Using Kibana’s Painless Lab (Beta) to test an ingest processor script

Introduction

Example

Conclusion

Acknowledgemenet

Using Elasticsearch Painless scripting to recursively iterate through JSON fields

Authors

Introduction

Example one – remove empty fields

Example two – remove fields where the field name matches a regular expression

Conclusion

Introduction

Example

Conclusion

Acknowledgemenet

Share this:

Authors

Introduction

Example one – remove empty fields

Example two – remove fields where the field name matches a regular expression

Conclusion

Share this: