Using Logstash to scan inside event contents to replace sensitive data with a consistent hash

Introduction

Logstash is commonly used for transforming data before it is sent to another system for storage, and so it is often well positioned for finding and replacing sensitive text, as may be required for GDPR compliance.

Therefore, in this blog I show how Logstash can make use of a ruby filter to scan through the contents of an event and to replace each occurrence of sensitive text with the value of its hash.

This is done by making use of Ruby’s gsub functionality, providing a definition of a regular expression pattern that will be replaced (in this blog, we demonstrate with an email address regex pattern), and by executing a hash function to calculate the replacement value.

Note that the use case addressed in this blog is different than the use cases for the fingerprint filter. The fingerprint filter can combine and hash one or more fields, but it does not analyze or replace substrings inside the field(s).

This blog also demonstrates several other Logstash concepts, including:

  1. Use of a generator in Logstash to automatically create new events to make testing of your filters quick and easy.
  2. Defining custom ruby code inside your Logstash pipeline.
  3. Use of the stdout output to easily debug your Logstash pipeline.
  4. Automatic reload functionality to allow you to immediately validate any code changes that you make in your Logstash pipeline.

Acknowledgement

Thanks to my co-worker Joao Duarte at Elastic for coming up with the custom ruby filter that is presented in this blog.

Code description

The code given below demonstrates an entire Logstash pipeline that creates a simulated input event that contains a message with multiple email addresses in it. It then processes the event with a custom ruby filter which finds and replaces each email addresses in the “message” field with its SHA1 digest, and then writes the modified event to stdout.

Logstash pipeline

input {
    generator {
        lines => [
            '{"message": "Someone had an email address foobar@example.com and sent mail to foobaz@another-example.com"}'
        ]
        count => 1
        codec => "json"
    }
}

filter {
  ruby {
    init => "require 'digest'; @email_regex = /([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})/"
    code => "str = event.get('message'); event.set('message', str.gsub(@email_regex) {|v| Digest::SHA1.hexdigest(v) })"
  } 
}

output {
    stdout { codec => "rubydebug" }
}

Executing the pipeline

The above pipeline can be executed with the following command line, which will automatically restart the pipeline each time it is modified:

./bin/logstash -f <your pipeline config file> --config.reload.automatic

This will generate the following output, which confirms that the email addresses in the message field (of the event created in the generator) have been replaced with the SHA1 hash value, as desired:

{
      "@version" => "1",
          "host" => "New2020MacBook.lan",
       "message" => "Someone had an email address 6f25d1a16b65ee184e83d06a268af7f44d4e8a10 and sent mail to f1377593215966404efb0f42c6ce48017c2c5522",
    "@timestamp" => 2022-01-20T17:50:10.598Z,
      "sequence" => 0
}

Conclusion

In this brief post, I have demonstrated the following concepts:

  1. How to use a generator to create a custom event to help easily verify the functionality of your Logstash filters and to help debug your Logstash pipeline.
  2. How to define a custom Ruby filter in your Logstash pipeline.
  3. How to make use of Ruby’s gsub functionality along with a regular expression and a call to a SHA1.hexdigest function, for replacing sensitive text.
  4. How to view the resulting modified events on stdout.
  5. How to automatically reload your pipeline as your make edits.

A summary of recent research on increasing lifespan

Authors

  • Ronald R. Marquardt
  • Suzhen Li
  • Alexander Marquardt

In association with All Natural Nutritional Products (ANNP) Inc.

Introduction

This blog provides an overview of recent research in the field of ageing.

A brief overview of the book: “Lifespan: Why we age — and why we don’t have to”

David A. Sinclair and M. D. LaPlante have written a New York Times Bestseller called Lifespan: Why we age — and why we don’t have to. David Sinclair has also given many interviews and talks which can be found online.

Below we summarize some of the highlights from this book and from his online discussions.

  • David’s theory is that ageing is quite simply a loss of information over time.
  • Our cells store both digital and analog information. The main difference between them is that analog information is continuous whilst digital is discrete.
  • Our genetic code is digital, while the epigenetic code that determines and dictate which genes are expressed in each cell is analog. For example, even though skin cells and nerve cells have the same DNA, it is the epigenetic code that expresses different genes in a skin cell than would be expressed in a nerve cell.
  • Analog information is prone to the accumulation of noise and disruption over time.
  • David suggests that noise accumulation in epigenetic information disrupts gene expression and other nuclear processes within a cell leading to malfunctioning and inevitably ageing of the organism.
  • In other words, ageing doesn’t involve a change in our DNA sequences but can change how our body reads the DNA sequences which unlike genetic changes, may be reversible. Most epigenetic changes involve the reversible binding of certain compounds such as methyl groups to DNA.

Metabolic changes associated with ageing

  • Autophagy is an essential catabolic process that promotes the clearance of waste cellular component (cellular garbage). Loss of autophagy results in cell death and a number of human diseases. Caloric restriction or starvation promotes autophagy and therefore, longevity and healthy living. Excess caloric intake results in the opposite effect, obesity, etc.
  • A critical metabolic compound required for the autophagic destruction of waste cellular material (garbage) is NAD+ (oxidized nicotinamide adenine dinucleotide).
  • The enzyme sirtuin uses NAD+ to activate lysozymes which autophagically degrades waste protein.
  • Sirtuin is a very important enzyme and is activated by resveratrol and some other nature compounds.
  • Essential requirements for autophagy and longevity:
    1. NAD+ and not NADH (oxidized vs reduced form)
    2. NAD+ is increased by fasting (caloric restriction), NAD+ precursor supplementation and activation of sirtuins by compounds such as resveratrol.
    3. The net effect is an enhancement of autophagy and life expectancy.

Suggestions for life style regime for longevity and healthy living (What does David do?)

David Sinclair does not make recommendations for any products or supplements, however he has openly discussed his daily regime which includes the following:

  • 1 gram of NMN: NMN is converted into NAD+ which is required for autophagy. NMN is expensive. Two new NMN forms have been very recently discovered which are 5 to 10 times more effective then NMN. They should be available in 2 to 5 years.
  • 1 gram of resveratrol. It enhances sirtuin activity which uses NAD+ to enhance autophagy. Health Canada recommends one 250 mg capsule containing 98% resveratrol, twice per day.
  • 1 gram of metformin (a prescription drug for diabetes but has autophagy benefits)
  • Takes a daily dose of vitamin D, vitamin K2 and 83 mg of aspirin (baby aspirin.
  • David has given up on desserts (he doesn’t want to consume sucrose, a metabolic poison).
  • David tries to skip a meal per day.
  • Our opinion of additional supplements that David could add to his regimen:
    1. Psyllium (Metamucil): A plant based soluble fiber or some other soluble fiber such as beta glucans or arabinoxylans. They reduce nutrient up take and favorably modifies gut microflora which produce beneficial short chain fatty acids such as butyric acid.
    2. Grape seed extract, high in anthocyanidins (antioxidants).
    3. Daidzein, a phytoestrogen in soy isoflavones. Important for extension of life.
    4. Apigenin, luteolin or quercetin as inhibitors of the CD 38 and CD 157 NADase pathway. The net effect of these supplements is to increase NAD+ which can be used by sirtuins.

Autophagy A requirement for healthy living and longevity

Autophagy is an essential process in which damaged cellular proteins are degraded by lysosomes. Autophagy is an essential for homeostatic maintenance. Disruption of autophagy results in many metabolic diseases including a greatly shortened life span.

The sirtuins enzymes play an essential role in regulating autophagy including aging and various age-related disease (including diabetes, cancer, and heart, liver, kidney diseases). [See review by Lee, Experimental and Molecular Medicine 511:102, 2019].

The sirtuins enzymes react with NAD+ to remove acetyl groups from protein, DNA, and histones which is an essential autophagic step. Tissue concentration of NAD+ can be increased by calorie restriction or supplementation of NAD+ precursors such NR or NMN. Resveratrol, a natural product, greatly increases the activity of sirtuins.

Supplements that can increase sirtuin activity and autophagy

This section is based on a review of “The rise of NAD+ and sirtuin – activating compounds.” [Bonkowski and Sinclair, Natural Reviews, Molecular Cell Biology 17:679-690. Important supplements include the following:

  • NAD+- precursors: NMN and NR are currently the best available supplements. NAD+ is an essential substrate for the sirtuin enzyme.
  • Sirtuin -activity compounds (STACs):
  • Resveratrol: It is the most potent natural activator of the essential enzyme. It is found in red wine and supplements.
  • Metformin: Metformin is a prescription drug for the control of type 2 diabetes. It also increases sirtuin activity.
  • Resveratrol and Metformin: A combination of two compounds when added to the diet synergistically enhanced SIRT1 (sirtuin one) activity. They have been shown to alleviate insulin resistance in rodents fed high fat diets via amelioration of hyperlipidemia, hyperglycemia, inflammation and increased oxidative mechanisms. A potentially powerful antiaging combination. [E1 Agamy and Ahmed, Benha Medical Journal 37(3):561-577, 2020.]

Inhibition of enzymes that compete with sirtuin for NAD+:

  • There are three main classes of enzymes that consume NAD+ ; PARPs enzymes, sirtuins, and CD 38 and CD 157 enzymes. CD 38 and CD157 enzymes are upregulated in ageing tissue. The age-dependent decline of NAD+ levels with increased activity of CD 38 and CD 137 range from 10 to 65% depending on organ and age. CD 38 is one of the main NADases in mammals and plays a key role in age-related decline of NAD+. Several inhibitors of CD 38 have been identified including apigenin, quercetin and luteolin (naturally occurring flavonoids). Supplement of these compounds results in an inhibition of NAD+ utilization by CD 38 resulting in an increased amount for sirtuins and therefore an increased level of autophagy.
  • Inhibitor of CD 38: Current information on inhibition of NADases. Please see publication by Ogura et al., Aging (Albany NY) 12(12):11325-11336, 2020; and Covarrubias et al., Nature Reviews, Molecular Cell Biology 27:119-141, 2021.
  • Naturally occurring CD 38 inhibitors of NADases other than sirtuins that are flavonoids, such apigenin, quercetin and luteolin.

Summary of NAD+ and its precursors:

NAD+ is a coenzyme involved in the control of metabolism. It is involved in redox reactions carrying electrons from one reaction to another. NAD+ (nicotinamide dinucleotide) decreases with age. The following list of precursors can increase NAD+ levels.

  • Niacin (vitamin B3, nicotinic acid): Niacin is direct precursor of the coenzymes NAD+ and NADH.
  • NADH: NADH is the reduced form of NAD+ and is involved in the production of ATP, the energy-factory in the cell.
  • NR (nicotinamide riboside): It is niacin with a riboside group. NR is a precursor of NAD+ and is a more potent supplement than either niacin or NAD+. Niagen (NR) is commercially available supplement that is more potent than either niacin or NAD+.
  • NMN (nicotinamide mononucleotide): It is also a direct precursor of NAD+. It can be synthesized from NR by the addition of a phosphate group. It is much more potent as a supplement than either niacin or NAD+. NMN supplements are expensive but have antiaging benefits.
  • NRH (the reduced form of NR): NRH is a powerful newly discovered metabolite that increases NAD+ intercellular levels 50-fold greater than NR (Giroud – Gerbetant et al., Mol Metabolism 30:192-202, 2019). NRH supplements will probably prove to be highly effective antiaging compounds which should be available after approval by FDA and HC, probably in several years.
  • NMNH (the reduced form of NMN): NMNH is another newly discovered metabolite. It is a potent NAD+ precursor. NMNH increases NAD+ levels in the body to a much higher extent and faster than NMN or NR (Zapata -Perez et al., The FASEB Journal, 2021:35e21456).
  • NMNH and NRH are powerful newly discovered compounds: NMNH like NRH are exciting newly discovered metabolites that have considerable potential to increase cellular NAD+ levels and probably autophagy and, therefore, to prevent metabolic diseases. These compounds may prove to have powerful antiaging properties but may not be commercially available for several years. Can life expectancy be greatly increased when these compounds are consumed?

E. Pathways for synthesis of different precursors of NAD+

The potency of NRH and NMNH are much greater (~50 fold) than NR and NMN. NR and NMN are more potent than niacin and NAD+ but much less potent than NRH and NMNH. All of the precursor compounds are converted into NAD+ in the cell. Currently, NR and NMN are commercially available supplements.

A model for the connection between NAD+, sirtuins and autophagy

Nutrient limitation increases the level of the essential cellular metabolite NAD+, which is utilized by sirtuins to increase autophagy within cells and tissues. The activity of sirtuins and autophagy decreases during normal aging as well as due to some diseases and correlates with the known reduction in levels of tissue NAD+. From Lee 2019, Experimental & Molecular Medicine.

Overview on procedures to increase NAD+ in the body

  1. Calorie restriction and exercise:
  2. Natural product supplements to increase tissue NAD+ levels:
    • NAD+ precursor supplements (NMN OR NR) will increase cellular NAD+ levels,
    • Resveratrol increases sirtuin activity thereby enabling its use of NAD+ for autophagy.
    • Apigenin, quercetin or luteolin will inhibit the activity of the CD 38 pathway particularly in the elderly. This will result in an increased level of NAD+ which can be used by sirtuins.
    • The combined benefits of taking the three compounds should provide greatly enhanced synergistic anti-aging benefits. Unfortunately, research on the combined safety and benefits of taking the combined supplements has not been reported.

A summary of therapeutic approaches to restore NAD+ levels and their impact on health


Ageing is associated with decreased nicotinamide adenine dinucleotide (NAD+) levels in tissues that promote or exacerbate ageing-related diseases. Thus, restoring NAD+ levels has emerged as a therapeutic approach to prevent and treat ageing-related diseases and to restore health and vigour during the ageing process. Some potential strategies that boost NAD+ levels include lifestyle changes, such as increasing exercise, reducing caloric intake, eating a healthy diet and following a consistent daily circadian rhythm pattern by conforming to healthy sleeping habits and mealtimes. Another approach is the use of small-molecule inhibitors or activators to boost NAD+ biosynthesis and the use of dietary supplements, including NAD+ precursors, such as nicotinamide mononucleotide (NMN) and nicotinamide riboside (NR). All of these approaches promote increased tissue NAD+ levels and are beneficial for health. These include improved tissue and organ function, protection from cognitive decline, improved metabolic health, reduced inflammation and increased physiological benefits, such as increased physical activity, which may collectively extend patient health span and potentially lifespan [Covarrubias et al., Nature Reviews, Molecular Cell Biology 22:119-141, 2021.]

Conclusion

In this blog we provided an overview of some of the recent research in ageing, as well as a list of supplements and lifestyle changes which theoretically may slow the rate of ageing.

Combining Elasticsearch stemmers and synonyms to improve search relevance

The article called The same, but different: Boosting the power of Elasticsearch with synonyms gives a great introduction to why and how you can incorporate synonyms into your Elasticsearch-powered application. Here I build upon that blog and show how you can combine stemmers and multi-word synonyms to take the quality of your search results to the next level.

Motivation

Imagine that you are using Elasticsearch to power a search application for finding books, and in this application you want to treat the following words as synonyms:

  • brainstorm
  • brainstorming
  • brainstormed
  • brain storm
  • brain storming
  • brain stormed
  • envisage
  • envisaging
  • envisaged
  • etc.

It is tedious and error prone to explicitly use synonyms to define all possible conjugations, declensions, and inflections of a word or of a compound word.

However, it is possible to reduce the size of the list of synonyms by making use of a stemmer to extract the stem of each word before applying synonyms. This would allow us to get the same results as the above synonym list by specifying only the following synonyms:

  • brainstorm
  • brain storm
  • envisage

Custom analyzers

In this section, I show code snippets that define custom analyzers that can be used for matching synonyms. Later on in this blog I show how to submit the analyzers to Elasticsearch.

The blog called The same, but different: Boosting the power of Elasticsearch with synonyms goes into details on the difference between index-time and search-time synonyms. In the solution presented here, I make use of search-time synonyms.

We will create a synonym graph token filter that matches multi-word synonyms and will be called “my_graph_synonyms” as follows:

        "filter": {
          "my_graph_synonyms": {
            "type": "synonym_graph",
            "synonyms": [
              "mind, brain",
              "brain storm, brainstorm, envisage"
            ]
          }
        }

Next we need to define two separate custom analyzers, one that will be applied to text at index-time, and another that will be applied to text at search-time.

We define an analyzer called “my_index_time_analyzer” which uses the standard tokenizer and the lowercase token filter and the stemmer token filter as follows:

      "my_index_time_analyzer": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "stemmer"
        ]
      }

We define an analyzer called “my_search_time_analyzer”, which also makes use of the standard tokenizer and the lowercase token filter and the stemmer token filter (as above). However, this also includes our custom token filter called “my_graph_synonyms”, which ensures that synonyms will be matched at search-time:

      "my_search_time_analyzer": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "stemmer",
          "my_graph_synonyms"
        ]
      }

Mappings

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. Each document is a collection of fields, which each have their own data type. In this example we define the mapping for a document with a single field called “my_new_text_field”, which we define as “text”. This field will make use of “my_index_time_analyzer” when documents are indexed, and will make use of “my_search_time_analyzer” when documents are searched. The mapping looks as follows:

  "mappings": {
    "properties": {
      "my_new_text_field": {
        "type": "text",
        "analyzer": "my_index_time_analyzer",
        "search_analyzer": "my_search_time_analyzer"
      }
    }
  }

Bringing it together

Below we bring together our custom analyzers and mappings and apply it to an index called “test_index” as follows:

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_graph_synonyms": {
            "type": "synonym_graph",
            "synonyms": [
              "mind, brain",
              "brain storm, brainstorm, envisage"
            ]
          }
        },
        "analyzer": {
          "my_index_time_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "stemmer"
            ]
          },
          "my_search_time_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "stemmer",
              "my_graph_synonyms"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_new_text_field": {
        "type": "text",
        "analyzer": "my_index_time_analyzer",
        "search_analyzer": "my_search_time_analyzer"
      }
    }
  }
}

Testing our custom search-time analyzer

If we wish to see how an analyzer is tokenizing and normalizing a given string, we can directly call the _analyze api as follows:

POST test_index/_analyze
{
  "text" : "Brainstorm",
  "analyzer": "my_search_time_analyzer"
}

Testing on real documents

We can use the _bulk API to drive several documents into Elasticsearch as follows:

POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_new_text_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_new_text_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_new_text_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_new_text_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_new_text_field": "I envisaged something like that" }

After driving the sample documents into “test_index”, we can execute a search that will correctly respond with document #1, #2, #3 and #5, as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": "brain storm"
    }
  }
}

We can execute the following search which correctly returns only documents #2 and #4, as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": "brain"
    }
  }
}

We can execute the following search which will correctly respond with document #1, #2, #3 and #5, as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": "brainstorming"
    }
  }
}

We can execute the following search which correctly returns documents #2 and #4, as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": "mind storm"
    }
  }
}

And finally, we can execute the following search which correctly returns only documents #2 and #4 as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": {
        "query": "storm brain"
      }
    }
  }
}

Conclusion

In this blog I demonstrated how you can combine stemmers and multi-word synonyms in Elasticsearch to improve the quality of your search results.

Driving Filebeat data into separate indices (uses legacy index templates)

Introduction

When driving data into Elasticsearch from Filebeat, the default behaviour is for all data to be sent into the same destination index regardless of the source of the data. This may not always be desirable since data from different sources may have different access requirements , different retention policies, or different ingest processing requirements.

In this post, we’ll use Filebeat to send data from separate sources into multiple indices, and then we’ll use index lifecycle management (ILM), legacy index templates, and a custom ingest pipeline to further control that data.

If you are interested in using the newer composable templates and data streams to control your Filebeat data, see the blog How to manage Elasticsearch data across multiple indices with Filebeat, ILM, and data streams.

Driving different data types into different destinations

You may have several different types of data that may be collected by filebeat, such as “source1”, “source2”, etc. As these may have different access/security requirements and/or different retention requirements it may be useful to drive data from different sources into different filebeat indices.

Keep in mind that splitting the filebeat data into different destinations adds complexity to the deployment. For many customers, the default behaviour of driving all filebeat data into a single destination pattern is acceptable and does not require the custom configuration that we outline below.

Step 1 – Create alias(es)

Each destination “index” that we will specify in Filebeat will actually be an alias so that index lifecycle management (ILM) will work correctly.

We can create an alias that will work with the Filebeat configuration that we give later in this blog, as follows:

PUT filebeat-7.10.2-source1-000001
{
  "aliases": {
    "filebeat-7.10.2-source1": {
      "is_write_index": true
    }
  }
}

You would want to do the same for other data sources (eg. source2, etc).

In the above alias, by naming the index filebeat-7.10.2-source1, which includes the version number after the word filebeat, we ensure that the default template that is pushed into the cluster by filebeat will be applied to the index.

Step 2 – Define an ILM policy

You should define the index lifecycle management policy (see this link for instructions).

A single policy can be used by multiple indices, or you can define a new policy for each index. In the next section, I assume that you have created a policy called “filebeat-policy”.

Step 3 – Ensure that ILM will use the correct rollover alias

In order for ILM to automate rolling over of indices, we define the rollover alias that will be used. This can be done by creating a high-order template (that overwrites the lower-order templates) so that each unique data type will have a unique rollover alias. For example, the following template can be used to ensure that the source1 data rolls over correctly:

PUT _template/filebeat-7.10.2-source1-ilm
{
  "order": 50,
  "index_patterns": [
    "filebeat-7.10.2-source1-*"
  ],
  "settings": {
    "index": {
      "lifecycle": {
        "name": "filebeat-policy",
        "rollover_alias": "filebeat-7.10.2-source1"
      }
    }
  }
}

Step 4 – (Optional) Define a custom ingest pipeline

Below is an example of a very simple ingest pipeline that we can use to modify documents that are ingested into Elasticsearch. As we will see in the next section, this can selectively be applied to data from different sources depending on the destination index name.

If the ingest pipeline has a failure in it, then the document that triggered the failure is rejected. This is likely undesirable, and may be enhanced by including on_failure error handling into the pipeline code as shown below:

PUT _ingest/pipeline/my_custom_pipeline
{
  "description": "Put your custom ingest pipeline code here",
  "processors": [
    {
      "set": {
        "field": "my-new-field",
        "value": "some new value",
        "on_failure": [
          {
            "set": {
              "field": "error.message",
              "value": "my_custom_pipeline failed to execute set - {{ _ingest.on_failure_message }}"
            }
          }
        ]
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "error.message",
        "value": "Uncaught failure in my_custom_pipeline: {{ _ingest.on_failure_message }}"
      }
    }
  ]
}

Step 5 – (Optional) Execution of the custom ingest pipeline

The following template will ensure that our ingest pipeline will be executed against all documents arriving to indices that match the specified index patterns.

Be sure to modify the index patterns below to include all of the index names that should have our custom pipeline applied.

PUT _template/filebeat-template-to-call-my-custom-processors
{
  "order": 51,
  "index_patterns": [
    "filebeat-*-source1-*",
    "filebeat-*-source2-*"
  ],
  "settings": {
    "index": {
      "final_pipeline": "my_custom_pipeline"
    }
  },
  "mappings": {},
  "aliases": {}
}

Notice that we have used final_pipeline rather than default_pipeline. This is to ensure that calling our custom pipeline does not override any of the default Filebeat pipelines that may be called by various Filebeat modules.

Step 6 – Filebeat code to drive data into different destination indices

The following filebeat code can be used as an example of how to drive documents into different destination index aliases. Note that if the alias does not exist, then filebeat will create an index with the specified name rather than driving into an alias with the specified name, which is undesirable. Therefore, defining the alias as shown in Step 1 of this blog should be done before running filebeat with this configuration.

The example below would drive data into an alias (or index) called filebeat-7.10.2-source1 (assuming we are running Filebeat version 7.10.2).

filebeat.inputs:
- type: log
 enabled: true
 paths:
   - /tmp/<path to your input>.txt
 fields_under_root: true
 fields:
   data_type: "source1"


setup.template.enabled: true
setup.template.name: "filebeat-%{[agent.version]}"
setup.template.pattern: "filebeat-%{[agent.version]}-*"
setup.template.fields: "fields.yml"
setup.template.overwrite: false
setup.ilm.enabled: false # we handle ILM in the cluster, so not defined here


output.elasticsearch:
 hosts: ["localhost:9200"]
 indices:
   - index: "filebeat-%{[agent.version]}-%{[data_type]}"

In the above example, there are several setup.template settings which will ensure that the default filebeat templates are loaded correctly into the cluster if they do not already exist. See configure elasticsearch index template loading for more information.

Upgrading filebeat

Once you have implemented the above, then when upgrading to a new version of filebeat you will have to ensure that a new index alias is pointing to the correct underlying indices (re-execute step 1), and that ILM will use the correct alias (re-execute step 3). If these steps are done, then the new version of filebeat should be able to execute the same filebeat.yml that we have defined above in step 6 without modification.

Appendix 1 – code for testing the ingest pipeline

The pipeline below will send two documents into the pipeline given in Step 4. This can be used for validating that the pipeline is behaving as expected.

POST _ingest/pipeline/my_custom_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "message": "this is doc 1"
      }
    },
    {
      "_source": {
        "message": "this is doc 2"
      }
    }
  ]
}

Conclusion

In this post, I showed how to use Filebeat to send data from separate sources into multiple indices, and how to use index lifecycle management (ILM), legacy index templates, and a custom ingest pipeline to further control that data.

%d bloggers like this: