How to configure PKI authentication in Elasticsearch

Introduction

In many organizations it is forbidden to store usernames and passwords in source code or configuration files, which means that the basic native realm username and password authentication in Elasticsearch cannot be used for server-to-server authentication. In such cases one alternative is to use PKI Authentication for authenticating to an Elasticsearch cluster.

Because the documentation for setting up PKI authentication can be overwhelming, in this blog post we provide a step-by-step guide on how to use client TLS/SSL certificates to authenticate to an Elasticsearch cluster. We then show how to configure Kibana to authenticate to an Elasticsearch cluster using only a client certificate, which removes the requirement for embedding a username and password in the “kibana.yml” configuration file.

Enabling security

In order to enable TLS/SSL it is necessary to have either a gold or platinum subscription, or to enable a trial license via kibana or enable a trial license via the API.

Once you have enabled a license, you need to enable security. This is done in the elasticsearch.yml file with the following line:

xpack.security.enabled: true

Note that Elasticsearch has two levels of communications, the transport layer and http layer. The transport layer is used for internal communications between Elasticsearch nodes, and the http layer is used for communications from clients to the Elasticsearch cluster.

Transport layer TLS/SSL encryption

The transport layer is used for communication between nodes within the Elasticsearch cluster. Because each node in an Elasticsearch cluster is both a client and a server to other nodes in the cluster, all transport certificates must be created as both client and server certificates. If TLS/SSL certificates do not have “Extended Key Usage” defined, then they are already both client and server certificates. If transport certificates do have an “Extended Key Usage” section, which is usually the case for “real” certificates used in corporate environments, then they must explicitly enable use as both client and server certificates on the transport layer.

Note that Elasticsearch comes with a utilitity called elasticsearch-certutil that can be used for generating self-signed certificates that can be used for encrypting internal communications within an Elasticsearch cluster.

Using elasticsearch-certutil for encrypting transport communications is a valid option, as the corporate CA will not be able to create certificates that would allow a new node to join the Elasticsearch cluster — only someone with access to the newly created self-signed root CA’s private key will be able create certificates that would allow new nodes to join the cluster.

The following commands can be used for generating certificates that can be used for transport communications, as described in this page on Encrypting Communications in Elasticsearch:

bin/elasticsearch-certutil ca
ENTER ENTER
bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12
ENTER ENTER ENTER

Once the above commands have been executed, you will have certificates that can be used for encrypting communications, which will be specified in the elasticsearch.yml file as follows:

xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: path/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: path/elastic-certificates.p12

Http layer TLS/SSL encryption

For Http communications, the Elasticsearch nodes will only act as servers and therefore can use Server certificates – i.e. Http TLS/SSL certificates do not need to enable Client authentication.

In many cases, certificates for Http communications would be signed  by a corporate CA. It is worth noting that the certificates used for encrypting Http communications are totally independent from the certificates that are used for transport communications.

To reduce the number of steps in this blog, we use the same certificates for Http communications as we use for the transport communications. These are specified in the elasticsearch.yml file as follows:

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: path/elastic-certificates.p12
xpack.security.http.ssl.truststore.path: path/elastic-certificates.p12
xpack.security.http.ssl.client_authentication: optional

Enabling PKI authentication

As discussed in Configuring a PKI Realm, the following must be added to the elasticsearch.yml file to allow PKI authentication.

xpack.security.authc.realms.pki1.type: pki

Combined changes to elasticsearch.yml

Once the above steps have been followed, you should have a section in your elasticsearch.yml file that looks similar to the following:

xpack.security.enabled: true

xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: path/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: path/elastic-certificates.p12

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: path/elastic-certificates.p12
xpack.security.http.ssl.truststore.path: path/elastic-certificates.p12
xpack.security.http.ssl.client_authentication: optional

xpack.security.authc.realms.pki1.type: pki

Creating a client certificate

Certificates that will be used for PKI authentication must be signed by the same CA as the certificates that are used for encrypting the Http layer communications. Normally these are signed by an official CA within an organization.

However, as I do not have an official CA, and because we used a self signed CA to sign certificates used in the Http layer ssl configuration, I am also signing client certificates with our self-signed CA. We can create such a certificate for client authentication as follows:

bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12 \
-name "CN=something,OU=Consulting Team,DC=mydomain,DC=com"
ENTER
client.p12 ENTER
ENTER

The above will create a file called client.p12, which contains all of the information required for PKI authentication to your Elasticsearch cluster. However, in order to use this certificate we need to break it into its private key, public certificate, and CA certificate. This can be done with the following commands:

openssl pkcs12 -in client.p12 -nocerts -nodes | \
sed -ne '/-BEGIN PRIVATE KEY-/,/-END PRIVATE KEY-/p' > client.key
ENTER
openssl pkcs12 -in client.p12 -clcerts -nokeys | \
sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > client.cer
ENTER

openssl pkcs12 -in client.p12 -cacerts -nokeys -chain | \
sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > client-ca.cer
ENTER

Which should produce three files,

  1. “client.key” – The private key.
  2. “client.cer” – The public certificate.
  3. “client-ca.cer” – The certificate of the CA that signed the public certificate.

PKI Authentication

We can now use these three files to test PKI authentication to the cluster using curl as demonstrated with the call to the authenticate API below.

curl https://localhost:9200/_xpack/security/_authenticate?pretty \
--key client.key --cert client.cer --cacert client-ca.cer -k -v

Be sure to replace “localhost” with the name of a node in your Elasticsearch cluster. Also note that the -k option is required as we did not create certificates with the hostnames specified, and therefore hostname verification must be turned off.

The above command should respond with something similar to the following:

{
  "username" : "something",
  "roles" : [ ],
  "full_name" : null,
  "email" : null,
  "metadata" : {
    "pki_dn" : "CN=something, OU=Consulting Team, DC=mydomain, DC=com"
  },
  "enabled" : true
}

Notice that the “roles” is currently empty which means that although we have authenticated to Elasticearch, we are not authorized to perform any actions. This authentication is allowed, because the client certificate that we sent to the cluster was signed by the same CA as the Http TLS/SSL certificates used by the Elasticsearch nodes. Now that we are authenticated, we need to authorize this user to be able to do something.

The “pki_dn” value returned from the authenticate API can now be used to configure the roles that will be assigned to this certificate. This can be done by copying the “pki_dn” value into the “dn” field as follows. This command should be executed by a user with sufficient privileges to modify roles, such as the “elastic” user:

PUT _xpack/security/role_mapping/my_dummy_user
{
  "roles" : [ "kibana_user", "watcher_user" ],
  "rules" : { "field" : { "dn" : "CN=something, OU=Consulting Team, DC=mydomain, DC=com" } },
  "enabled": true
}

Now that we have assigned “kibana_user” and “watcher_user” roles to this certificate, we can again execute a call to the authenticate API as follows:

curl https://localhost:9200/_xpack/security/_authenticate?pretty \
--key client.key --cert client.cer --cacert client-ca.cer -k -v

And we should see the following response, which indicates that we now have the roles that we have just assigned to this certificate.

{
  "username" : "something",
  "roles" : [
    "kibana_user",
    "watcher_user"
  ],
  "full_name" : null,
  "email" : null,
  "metadata" : {
    "pki_dn" : "CN=something, OU=Consulting Team, DC=mydomain, DC=com"
  },
  "enabled" : true
}

Using PKI to secure communications from Kibana to the Elasticsearch cluster

Once you have tested your client-side certificates, you can assign the “kibana_system” role to a certificate as follows. Note that this will overwrite any previous roles assigned to this certificate:

PUT _xpack/security/role_mapping/kibana_certificate
{
  "roles" : [ "kibana_system" ],
  "rules" : { "field" : { "dn" : "CN=something, OU=Consulting Team, DC=mydomain, DC=com" } },
  "enabled": true
}

We can now remove the following lines from our “kibana.yml” file:

elasticsearch.username: "kibana"
elasticsearch.password: "XXXXXX"

And add the following line for authenticating with the client certificate:

xpack.security.enabled: true
elasticsearch.ssl.certificate: path/client.cer
elasticsearch.ssl.key: path/client.key
elasticsearch.ssl.certificateAuthorities: path/client-ca.cer
elasticsearch.ssl.verificationMode: certificate

You can now restart kibana, and it should authenticate to your Elasticsearch cluster, without the need for an embedded username and password!

Conclusion

In this blog post, we have demonstrated how to enable PKI authentication and how PKI authentication can be used instead of usernames and passwords to authenticate to an Elasticsearch cluster. We then demonstrated how using PKI authentication can eliminate the need for users and passwords in the “kibana.yml” file.

Using Logstash to drive filtered data from a single source into multiple output destinations

Overview

In this blog post we demonstrate how Logstash can be used to accomplish the following tasks:

  1. Create multiple copies of an input stream.
  2. Filter each unique copy of the input stream to only contain desired fields.
  3. Drive the modified copies of the input stream into different output destinations.

Note that in this blog post, we do not make use of pipeline-to-pipeline communication (beta) which could also likely achieve much of the functionality described here.

Example input file

As an input to Logstash, we use a CSV file that contains stock market trades. A few example CSV stock market trades are given below. 

1483230600,1628.75,1678.1,1772.8,2443.6
1483232400,1613.63,1688.5,1750.5,2460.2
1483234200,1606.51,1678.6,1718,2448.2
1483236000,1621.04,1684.1,1708.1,2470.4

The comma separated values represent  “time”, “DAX”, “SMI”, “CAC”, “FTSE” . You may wish to copy and paste the above lines into a CSV file called stocks.csv in order to execute the example logstash pipeline. 

Example Logstash pipeline

Below is a logstash pipeline that should be stored in a file called ‘clones.conf’. This pipeline does the following:

  1. Reads stock market trades as CSV-formatted input from a CSV file. Note that you should modify ‘clones.conf’ to use the correct path to your ‘stocks.csv’ file.
  2. Maps each row of the CSV input to a JSON document, where the CSV columns map to JSON fields.
  3. Converts the time field to Unix format.
  4. Uses the clone filter plugin to create two copies of each document. The clone filter will add a new ‘type’ field to each copy that corresponds to the names given in the clones array. (Note that the original version of each document will still exist in addition to the copies, but will not have a ‘type’ field added to it).
  5. For each copy:
    1. Adds metadata to each document corresponding to the ‘type’ that was added by the clone function. This allows us to later remove the ‘type’ field, while retaining the information required for routing different copies to different outputs.
    2. Uses the prune filter plugin to remove all fields except those which are whitelisted for the specific output.
  6. Removes the ‘type’ field that the clone function inserted into the documents. This is not strictly necessary, but eliminates the ‘type’ data and prevents it from being written to Elasticsearch.
  7. Writes the resulting documents to different outputs, depending on the value defined in the metadata field that we added in step 5.
input {
  file {
    path => "${HOME}/stocks.csv"
    start_position => "beginning"

    # The following line will ensure re-reading of input 
    # each time logstash executes.
    sincedb_path => "/dev/null"
  }
}

filter {
   csv {
    columns => ["time","DAX","SMI","CAC","FTSE"]
    separator => ","
    convert => { 'DAX' => 'float'
    'SMI' => 'float'
    'CAC' => 'float'
    'FTSE' => 'float'}
  }
  date {
    match => ['time', 'UNIX']
  }

  # The following line will create 2 additional 
  # copies of each document (i.e. including the 
  # original, 3 in total). 
  # Each copy will have a "type" field added 
  # corresponding to the name given in the array.
  clone {
    clones => ['copy_only_SMI', 'copy_only_FTSE']
  }

  if [type] == 'copy_only_SMI' {
    mutate { 
      add_field => { "[@metadata][type]" => "copy_only_SMI" } 
    }
    # Remove everything except "SMI"
    prune {
       whitelist_names => [ "SMI"]
    }
  } 

  else if [type] == 'copy_only_FTSE' {
    mutate { 
      add_field => { "[@metadata][type]" => "copy_only_FTSE" } 
    }
    prune {
       whitelist_names => [ "FTSE"]
    }
  } 

  # Remove 'type' which was added in the clone
  mutate {
    remove_field => ['type']
  }
}

output {
  stdout { codec =>  "rubydebug" }

  if [@metadata][type] == 'copy_only_SMI' {
    elasticsearch {
      index => "smi_data"
    }
  }
  else if [@metadata][type] == 'copy_only_FTSE' {
    elasticsearch {
      index => "ftse_data"
    }
  }
  else {
    elasticsearch {
      index => "stocks_original"
    }
  }
}

Testing the logstash pipeline

To test this pipeline with the example CSV data, you could execute something similar to the following command, modifying it to ensure that you use paths that are correct for your system. Note that specifying ‘config.reload.automatic’ is optional, but allows us to automatically reload ‘clones.conf’ without restarting Logstash. Remember that ‘clones.conf’ that is used below is the file that contains the pipeline described in the previous section.

./logstash -f ./clones.conf --config.reload.automatic

Once logstash has read the stocks.csv file, we can check the various outputs that have been written. We have written three indexes called ‘smi_data’, ‘ftse_data’, and ‘stocks_original’.

Check the SMI index

GET /smi_data/_search

Should display documents with the following structure. Notice that only “SMI” data appears in the ‘smi_data’ index.

      {
        "_index": "smi_data",
        "_type": "doc",
        "_id": "_QRskWUBsYalOV9y9hGJ",
        "_score": 1,
        "_source": {
          "SMI": 1688.5
        }
      }

Check the FTSE index

GET /ftse_data/_search

Should display documents with the following structure. Notice that only “FTSE” field appears in documents in the ‘ftse_data’ index.

      {
        "_index": "ftse_data",
        "_type": "doc",
        "_id": "AgRskWUBsYalOV9y9hL0",
        "_score": 1,
        "_source": {
          "FTSE": 2448.2
        }
      }

Check the original documents index

GET /stocks_originals/_search

Should display documents with the following structure. Notice that the entire original version of the documents appears in the ‘stocks_original’ index.

      {
        "_index": "stocks_original",
        "_type": "doc",
        "_id": "-QRskWUBsYalOV9y9hFo",
        "_score": 1,
        "_source": {
          "host": "Alexanders-MBP",
          "@timestamp": "2017-01-01T00:30:00.000Z",
          "SMI": 1678.1,
          "@version": "1",
          "message": "1483230600,1628.75,1678.1,1772.8,2443.6",
          "CAC": 1772.8,
          "DAX": 1628.75,
          "time": "1483230600",
          "path": "/Users/arm/Documents/ES6.3/datasets/stocks_for_clones.csv",
          "FTSE": 2443.6
        }
      }

Conclusion

In this blog post, we have demonstrated how to use Logstash to create multiple copies of an input stream, to then modify documents in each stream as required for different outputs, and to then drive the different streams into different outputs.

Using Logstash prune capabilities to whitelist sub-documents

Overview

Logstash’s prune filter plugin can make use of whitelists to ensure that only specific desired fields are output from Logstash, and that all other fields are dropped. In this blog post we demonstrate the use of Logstash to whitelist desired fields and desired sub-documents before indexing into Elasticsearch.

Example input file

As an input to Logstash, we use a CSV file that contains stock market trades. A few example CSV stock market trades are given below. 

1483230600,1628.75,1678.1,1772.8,2443.6
1483232400,1613.63,1688.5,1750.5,2460.2
1483234200,1606.51,1678.6,1718,2448.2
1483236000,1621.04,1684.1,1708.1,2470.4

The comma separated values represent  “time”, “DAX”, “SMI”, “CAC”, “FTSE” . You may wish to copy and paste the above lines into a CSV file called stocks.csv in order to execute the example command line given later in this blogpost. 

Example Logstash pipeline

Below is a Logstash pipeline which can be stored in a file called ‘stocks.conf’, that does the following:

  1. Reads stock market trades as CSV-formatted input from stdin.
  2. Maps each row of the CSV input to a JSON document, where the CSV columns map to JSON fields.
  3. Converts the time field to Unix format.
  4. Moves DAX and CAC fields into a nested structure called “my_nest”.
  5. Whitelists the “my_nest” field (which contains a sub-document) and the “SMI” field so that all other (non-whitelisted) fields will be removed.
  6. Writes the resulting documents to an Elasticsearch index called “stocks_whitelist_test”.
# For this simple example, pipe in data from stdin. 
input {
    stdin {}
}

filter {
    csv {
        columns => ["time","DAX","SMI","CAC","FTSE"]
        separator => ","
        convert => { 'DAX' => 'float'
        'SMI' => 'float'
        'CAC' => 'float'
        'FTSE' => 'float'}
    }
    date {
        match => ['time', 'UNIX']
    }
    mutate {
        # Move DAX and CAC into a sub-document 
        # called 'my_nest'
        rename => {
            "DAX" => "[my_nest][DAX]"
            "CAC" => "[my_nest][CAC]"
        }
    }
     
    # Remove everything except "SMI" and the 
    # "my_nest" sub-document 
    prune {
         whitelist_names => [ "SMI", "my_nest" ]
    }
}

output {
    stdout { codec => dots }
    elasticsearch {
        index => "stocks_whitelist_test"
    }
}

Testing the logstash pipeline

To test this pipeline with the example CSV data, you could execute something similar to the following command, modifying it to ensure that you use paths that are correct for your system:

cat ./stocks.csv | ./logstash -f ./stocks.conf

You can the check the data that you have stored in Elasticsearch by executing the following comand from Kibana’s dev console:

GET /stocks_whitelist_test/_search

Which should display documents with the following structure:

      {
        "_index": "stocks_whitelist_test",
        "_type": "doc",
        "_id": "KANygWUBsYalOV9yOKsD",
        "_score": 1,
        "_source": {
          "my_nest": {
            "CAC": 1718,
            "DAX": 1606.51
          },
          "SMI": 1678.6
        }
      }

Notice that only “my_nest” and “SMI” have been indexed as indicated by the contents of the document’s “_source”. Also note that the “FTSE” and “time” fields have been removed as they were not in the prune filter’s whitelist.

Conclusion

In this blog post, we have demonstrated how Logstash’s prune filter plugin can make use of whitelists to ensure that only specific desired fields are output from Logstash.

Deduplicating documents in Elasticsearch

Overview

In this blog post we cover how to detect and remove duplicate documents from Elasticsearch by using either Logstash or alternatively by using custom code written in Python.

Example document structure

For the purposes of this blog post, we assume that the documents in the Elasticsearch cluster have the following structure. This corresponds to a dataset that contains documents representing stock market trades.

    {
      "_index": "stocks",
      "_type": "doc",
      "_id": "6fo3tmMB_ieLOlkwYclP",
      "_version": 1,
      "found": true,
      "_source": {
        "CAC": 1854.6,
        "host": "Alexanders-MBP",
        "SMI": 2061.7,
        "@timestamp": "2017-01-09T02:30:00.000Z",
        "FTSE": 2827.5,
        "DAX": 1527.06,
        "time": "1483929000",
        "message": "1483929000,1527.06,2061.7,1854.6,2827.5\r",
        "@version": "1"
      }
    }

Given this example document structure, for the purposes of this blog we arbitrarily assume that if multiple documents have the same values for the [“CAC”, “FTSE”, “SMI”] fields that they are duplicates of each other.

Using logstash for deduplicating Elasticsearch documents

Logstash may be used for detecting and removing duplicate documents from an Elasticsearch index. This process is described in this blogpost.

In the example below I have written a simple Logstash configuration that reads documents from an index on an Elasticsearch cluster, then uses the fingerprint filter to compute a unique _id value for each document based on a hash of the [“CAC”, “FTSE”, “SMI”] fields, and finally writes each document back to a new index on that same Elasticsearch cluster such that duplicate documents will be written the the same _id and therefore eliminated.

Additionally, with minor modifications, the same Logstash filter could also be applied to future documents written into the newly created index in order to ensure that duplicates are removed in near real-time. This could be accomplished by changing the input section in the example below to accept documents from your real-time input source rather than pulling documents from an existing index.

Be aware that using custom _id values (i.e. an _id that is not generated by Elasticsearch) will have some impact on the write performance of your index operations. 

Also, it is worth noting that depending on the hash algorithm used, this approach may theoretically result in a non-zero number of hash collisions for the _id value, which could theoretically result in two non-identical documents being mapped to the same _id. For most practical cases, the probability of a has collision is likely very low.

A simple Logstash configuration to de-duplicate an existing index using the fingerprint filter is given below.

input {
  # Read all documents from Elasticsearch 
  elasticsearch {
    hosts => "localhost"
    index => "stocks"
    query => '{ "sort": [ "_doc" ] }'
  }
}

filter {
    fingerprint {
        key => "1234ABCD"
        method => "SHA256"
        source => ["CAC", "FTSE", "SMI"]
        target => "[@metadata][generated_id]"
    }
}

output {
    stdout { codec => dots }
    elasticsearch {
        index => "stocks_after_fingerprint"
        document_id => "%{[@metadata][generated_id]}"
    }
}

A custom Python script for deduplicating Elasticsearch documents

A memory-efficient approach

If Logstash is not used, then deduplication may be efficiently accomplished with a custom python script. For this approach, we compute the hash of the [“CAC”, “FTSE”, “SMI”] fields that we have defined to uniquely identify a document. We then use this hash as a key in a python dictionary, where the associated value of each dictionary entry will be an array of the _ids of the documents that map to the same hash.

If more than one document has the same hash, then the duplicate documents that map to the same hash can be deleted.  Alternatively, if you are concerned about the possibility of hash collisions, then the contents of documents that map to the same hash can be examined to see if the documents are really identical, and if so, then duplicates can be eliminated.

Detection algorithm analysis

For a 50GB index, if we assume that the index contains 0.4 kB documents, then there would be 125 million documents in the index. In this case the amount memory required for storing the deduplication data structures in memory when using a 128-bit md5 hash would be on the order of 128 bits x 125 Million = 2GB of memory, plus the 160 bit _ids will require another 160 bits x 125 Million = 2.5 GB of of memory. This algorithm will therefore require on the order of 4.5GB RAM to keep all relevant data structures in memory. This memory footprint can be dramatically reduced if the approach discussed in the following section can be applied.

Potential enhancement to the algorithm to reduce memory usage, as well as to continuously remove new duplicate documents

If you are storing time-series data, and you know that duplicate documents will only occur within some small amount of time of each other, then you may be able to improve this algorithm’s memory footprint by repeatedly executing the algorithm on a subset of the documents in the index, with each subset corresponding to a different time window. For example, if you have a years worth of data then you could use range queries on your datetime field (inside a filter context  for best performance) to step through your data set one week at a time. This would require that the algorithm is executed 52 times (once for each week) — and in this case, this approach would reduce the worst-case memory footprint by a factor of 52.

In the above example, you may be concerned about not detecting duplicate documents that span between weeks. Lets assume that you know that duplicate documents cannot occur more than 2 hours apart. Then you would need to ensure that each execution of the algorithm includes documents that overlap by 2 hours with the last set of documents analyzed by the previous execution of the algorithm. For the weekly example, you would therefore need to query 170 hours (1 week + 2 hours) worth of time-series documents to ensure that no duplicates are missed.

If you wish to periodically clear out duplicate documents from your indices on an on-going basis, you can execute this algorithm on recently received documents. The same logic applies as above — ensure that recently received documents are included in the analysis along with enough of an overlap with slightly older documents to ensure that duplicates are not inadvertently missed.

Python code to detect duplicate documents

The following code demonstrates how documents can can be efficiently evaluated to see if they are identical, and then eliminated if desired. However, in order to prevent accidental deletion of documents, in this example we do not actually execute a delete operation. Such functionality would be straightforward to include.

The code to deduplicate documents from Elasticsearch can also be found on github.

#!/usr/local/bin/python3
import hashlib
from elasticsearch import Elasticsearch
es = Elasticsearch(["localhost:9200"])
dict_of_duplicate_docs = {}

# The following line defines the fields that will be
# used to determine if a document is a duplicate
keys_to_include_in_hash = ["CAC", "FTSE", "SMI"]


# Process documents returned by the current search/scroll
def populate_dict_of_duplicate_docs(hits):
    for item in hits:
        combined_key = ""
        for mykey in keys_to_include_in_hash:
            combined_key += str(item['_source'][mykey])

        _id = item["_id"]

        hashval = hashlib.md5(combined_key.encode('utf-8')).digest()

        # If the hashval is new, then we will create a new key
        # in the dict_of_duplicate_docs, which will be
        # assigned a value of an empty array.
        # We then immediately push the _id onto the array.
        # If hashval already exists, then
        # we will just push the new _id onto the existing array
        dict_of_duplicate_docs.setdefault(hashval, []).append(_id)


# Loop over all documents in the index, and populate the
# dict_of_duplicate_docs data structure.
def scroll_over_all_docs():
    data = es.search(index="stocks", scroll='1m',  body={"query": {"match_all": {}}})

    # Get the scroll ID
    sid = data['_scroll_id']
    scroll_size = len(data['hits']['hits'])

    # Before scroll, process current batch of hits
    populate_dict_of_duplicate_docs(data['hits']['hits'])

    while scroll_size > 0:
        data = es.scroll(scroll_id=sid, scroll='2m')

        # Process current batch of hits
        populate_dict_of_duplicate_docs(data['hits']['hits'])

        # Update the scroll ID
        sid = data['_scroll_id']

        # Get the number of results that returned in the last scroll
        scroll_size = len(data['hits']['hits'])


def loop_over_hashes_and_remove_duplicates():
    # Search through the hash of doc values to see if any
    # duplicate hashes have been found
    for hashval, array_of_ids in dict_of_duplicate_docs.items():
      if len(array_of_ids) > 1:
        print("********** Duplicate docs hash=%s **********" % hashval)
        # Get the documents that have mapped to the current hasval
        matching_docs = es.mget(index="stocks", doc_type="doc", body={"ids": array_of_ids})
        for doc in matching_docs['docs']:
            # In order to remove the possibility of hash collisions,
            # write code here to check all fields in the docs to
            # see if they are truly identical - if so, then execute a
            # DELETE operation on all except one.
            # In this example, we just print the docs.
            print("doc=%s\n" % doc)



def main():
    scroll_over_all_docs()
    loop_over_hashes_and_remove_duplicates()


main()

 

Alexander Marquardt – Patents

Method and system for estimating the reliability of blacklists of botnet-infected computers

Versatile logic element and logic array block

  • US patent 7,218,133
  • Issued May 15, 2007 
  • See patent.

Routing Architecture For A Programmable Logic Device

  • US patent 6,970,014 
  • Issued Nov 29, 2005
  • See patent.

Alexander Marquardt Publications

Book

Architecture and CAD for Deep Sub-Micron FPGAs

Thesis

Cluster-Based Architecture, Timing-Driven Packing, and Timing-Driven Placement for FPGAs

Scientific Papers

The Stratix II Logic and Routing Architecture

The Stratix Routing and Logic Architecture

Speed and Area Trade-Offs in Cluster-Based FPGA Architectures

Timing-Driven Placement for FPGAs

Using Cluster Based Logic Blocks and Timing-Driven Packing to Improve FPGA Speed and Density

 

Trade-offs to consider when storing binary data in MongoDB

In this blog post we discuss several methods for storing binary data in MongoDB, and the trade-offs associated with each method.

Introduction

When using MongoDB there are several approaches that make it easy to store and retrieve binary data, but it is not always clear which approach is the most appropriate for a given application. Therefore, in this blog post we discuss several methods for storing binary data when using MongoDB and the trade-offs associated with each method. Many of the trade-offs discussed here would likely apply to most other databases as well.

Overview

In this blog post, we cover the following topics:

  • Definition of binary data
  • Methods for storing binary data with MongoDB
  • General trade-offs to consider when storing binary data in MongoDB
  • Considerations for embedding binary data in MongoDB documents
  • Considerations for storing binary data in MongoDB with GridFS
  • Considerations for storing binary data outside of MongoDB
  • Mitigating some of the drawbacks of storing binary data in MongoDB
  • Conclusion

Definition of binary data

While everything in a computer is technically stored as binary data, in the context of this posting when we refer to Binary Data we are referring to unstructured BLOB data such as PDF documents or Jpeg images. This is in contrast to structured data such as text or integers.

When considering trade-offs of storing binary data in MongoDB, we assume that the size of the binary data is hundreds of kilobytes or larger (eg. PDFs, Jpegs, etc.). If smaller binary objects are stored (eg. small thumbnails), then the drawbacks of storing binary data in MongoDB will be minimal and would be outweighed by the benefits.

Methods for storing binary data with MongoDB

Embed binary data in MongoDB documents using the BSON BinData type

MongoDB enforces a limit of 16MB per document, and so if binary data plus other fields in a document are guaranteed to be less than 16MB, then binary data can be embedded into documents by using the BinData BSON type. For more information using the BinData type, see the documentation for the appropriate driver.

Store binary data in MongoDB using GridFS

GridFS is a specification for storing and retrieving files that exceed the BSON document size limit of 16MB. Instead of storing a file in a single document, GridFS divides the file into chunks, and stores each chunk as a separate document. GridFS uses two collections to store files, where one collection stores the file chunks, and the other stores file metadata as well as optional application-specific fields.

An excellent two-part blog post (part 1 and part 2) gives a good overview of how GridFS works, and examples of when GridFS would be an appropriate storage choice.

Store binary data in an external system, and use MongoDB to store the location of the binary data

Data may be stored outside of the database, for example in a file system. The document in the database could then contain all the required information to know where to look for the associated binary data (a key, a file name, etc.), and it could also store metadata associated with the binary data.

General trade-offs to consider when storing binary data in MongoDB

Benefits of storing binary data in MongoDB

  • MongoDB provides high availability and replication of data.
  • A single system for all types of data results in a simpler application architecture.
  • When using geographically distributed replica sets, MongoDB will automatically distribute data to geographically distinct data centres.
  • Storing data in the database takes advantage of MongoDB authentication and security mechanisms.

Drawbacks of storing binary data in MongoDB

  • In a replica set deployment, data that is stored in the database will be replicated  to multiple servers, which uses more storage space and bandwidth than a single copy of the data would require.
  • If the binary data is large, then loading the binary data into memory may cause frequently accessed text (structured data) documents to be pushed out of memory, or more generally, the working set might not fit into RAM. This can negatively impact the performance of the database.
  • If binary data is stored in the database, then most backup methods will back up the binary data at the same frequency as all of the other data in the database. However, if binary data rarely changes then it might be desirable to back it up at a lower frequency than the rest of the data.
  • Storing binary data in the database will make the database larger than it would otherwise be, and if a sharded environment is deployed, this may cause balancing between shards to take more time and to be more resource intensive.
  • Backups of a database containing binary data will require more resources and storage than backups of a database without binary data.
  • Binary data is likely already compressed, and therefore will not gain much benefit from WiredTiger’s compression algorithms.

Considerations for embedding binary data in MongoDB documents

Benefits of embedding binary data in MongoDB documents

  • All of the benefits listed in the General trade-offs section.
  • If binary data that is stored in a document is always used at the same time as the other fields in that document, then all relevant information can be retrieved from the database in a single call, which should provide good performance.
  • By embedding binary data in a document, it is guaranteed that the binary data along with the rest of the document is written atomically.
  • It is simple to embed binary data into documents.

Drawbacks of embedding binary data in MongoDB documents

  • All of the drawbacks listed in the General trade-offs section.
  • There is a limit of 16MB per document. Embedding large binary objects in documents may risk the documents growing beyond this size limit.
  • If the structured metadata in a document is frequently accessed, but the associated embedded binary data is rarely required, then the binary data will be loaded into memory much more often than it is really needed. This is because of the fact that in order to read the structured metadata, the entire document including the embedded binary data is loaded into RAM. This will needlessly waste valuable memory resources and may cause the working set to be pushed out of RAM. This may result in increased disk IO as well as a slower database response time. This drawback can be addressed by separating the binary and associated metadata as we discuss later in this document.

Considerations for storing binary data in MongoDB with GridFS

Benefits of storing binary data in GridFS

  • All of the benefits listed in the General trade-offs section.
  • If the binary data is large and/or arbitrary in size, then GridFS may be used to overcome the 16MB limit.
  • GridFS doesn’t have the limitations of some filesystems, like number of documents per directory, or file naming rules.
  • GridFS can be used to recall sections of large files without reading the entire file into memory.
  • GridFS can efficiently stream binary data by loading only a portion of a given binary file into memory at any point in time.

Drawbacks of storing binary data in GridFS

  • All of the drawbacks listed in the General trade-offs section.
  • Versus embedding binary data directly into documents, GridFS will have some additional overhead for splitting, tracking, and stitching together the binary data.
  • Binary data in GridFS is immutable.

Considerations for storing binary data outside of MongoDB

Benefits of storing binary data outside of MongoDB

  • Storing data outside of MongoDB may reduce storage costs by eliminating MongoDB’s replication of data.
  • Binary data outside of MongoDB can be backed up at a different frequency than the database.
  • If binary data is stored in a separate system, then binary data will not be loaded into the same RAM as the database. This removes the risk of binary data pushing the working set out of RAM, and ensures that performance will not be negatively impacted.

Drawbacks of storing binary data outside of MongoDB

  • Adding an additional system for the storage of binary data will add significant operational overhead and system complexity versus using MongoDB for this purpose.
  • Alternative binary data storage systems will not take advantage of MongoDB’s high availability and replication of data.
  • Binary data will not be automatically replicated to geographically distinct data centres.
  • Binary data will not be written atomically with respect to its associated document.
  • Alternative binary data storage systems will not use MongoDB’s authentication and security mechanisms.

Mitigating some of the drawbacks of storing binary data in MongoDB

Separating metadata and embedded binary data into different collections

Instead of embedding binary data and metadata together, it may make sense for binary data to be stored in a collection that is separate from the metadata. This would be useful if metadata is accessed frequently but the associated binary data is infrequently required.

With this approach, instead of embedding the binary data in the metadata collection, the metadata collection would instead contain pointers to the relevant document in a separate binary data collection, and the desired binary data documents would be loaded on-demand. This would allow the documents from the metadata collection to be frequently accessed without wasting memory for the rarely used binary data.

Separating metadata and binary data into different MongoDB deployments

With this approach as in the previous approach, metadata and binary data are separated into two collections, but additionally these collections are stored on separate MongoDB deployments.

This would allow the binary data deployment to be scaled differently than the metadata deployment. Another benefit of this separation would be that the binary data can be backed up with a different frequency than the metadata.

Note that if metadata is separated into a separate deployment from the associated binary data then special care will have to be taken to ensure the consistency of backups.

Conclusion

There are several trade-offs to consider when deciding on the best approach for an application to store and retrieve binary data. In this blog post we have discussed several advantages and disadvantages of the most common approaches for storing binary data when working with MongoDB, as well as some techniques for minimising any drawbacks.