Using Logstash to drive filtered data from a single source into multiple output destinations

Overview

In this blog post we demonstrate how Logstash can be used to accomplish the following tasks:

  1. Create multiple copies of an input stream.
  2. Filter each unique copy of the input stream to only contain desired fields.
  3. Drive the modified copies of the input stream into different output destinations.

Note that in this blog post, we do not make use of pipeline-to-pipeline communication (beta) which could also likely achieve much of the functionality described here.

Example input file

As an input to Logstash, we use a CSV file that contains stock market trades. A few example CSV stock market trades are given below. 

1483230600,1628.75,1678.1,1772.8,2443.6
1483232400,1613.63,1688.5,1750.5,2460.2
1483234200,1606.51,1678.6,1718,2448.2
1483236000,1621.04,1684.1,1708.1,2470.4

The comma separated values represent  “time”, “DAX”, “SMI”, “CAC”, “FTSE” . You may wish to copy and paste the above lines into a CSV file called stocks.csv in order to execute the example logstash pipeline. 

Example Logstash pipeline

Below is a logstash pipeline that should be stored in a file called ‘clones.conf’. This pipeline does the following:

  1. Reads stock market trades as CSV-formatted input from a CSV file. Note that you should modify ‘clones.conf’ to use the correct path to your ‘stocks.csv’ file.
  2. Maps each row of the CSV input to a JSON document, where the CSV columns map to JSON fields.
  3. Converts the time field to Unix format.
  4. Uses the clone filter plugin to create two copies of each document. The clone filter will add a new ‘type’ field to each copy that corresponds to the names given in the clones array. (Note that the original version of each document will still exist in addition to the copies, but will not have a ‘type’ field added to it).
  5. For each copy:
    1. Adds metadata to each document corresponding to the ‘type’ that was added by the clone function. This allows us to later remove the ‘type’ field, while retaining the information required for routing different copies to different outputs.
    2. Uses the prune filter plugin to remove all fields except those which are whitelisted for the specific output.
  6. Removes the ‘type’ field that the clone function inserted into the documents. This is not strictly necessary, but eliminates the ‘type’ data and prevents it from being written to Elasticsearch.
  7. Writes the resulting documents to different outputs, depending on the value defined in the metadata field that we added in step 5.
input {
  file {
    path => "${HOME}/stocks.csv"
    start_position => "beginning"

    # The following line will ensure re-reading of input 
    # each time logstash executes.
    sincedb_path => "/dev/null"
  }
}

filter {
   csv {
    columns => ["time","DAX","SMI","CAC","FTSE"]
    separator => ","
    convert => { 'DAX' => 'float'
    'SMI' => 'float'
    'CAC' => 'float'
    'FTSE' => 'float'}
  }
  date {
    match => ['time', 'UNIX']
  }

  # The following line will create 2 additional 
  # copies of each document (i.e. including the 
  # original, 3 in total). 
  # Each copy will have a "type" field added 
  # corresponding to the name given in the array.
  clone {
    clones => ['copy_only_SMI', 'copy_only_FTSE']
  }

  if [type] == 'copy_only_SMI' {
    mutate { 
      add_field => { "[@metadata][type]" => "copy_only_SMI" } 
    }
    # Remove everything except "SMI"
    prune {
       whitelist_names => [ "SMI"]
    }
  } 

  else if [type] == 'copy_only_FTSE' {
    mutate { 
      add_field => { "[@metadata][type]" => "copy_only_FTSE" } 
    }
    prune {
       whitelist_names => [ "FTSE"]
    }
  } 

  # Remove 'type' which was added in the clone
  mutate {
    remove_field => ['type']
  }
}

output {
  stdout { codec =>  "rubydebug" }

  if [@metadata][type] == 'copy_only_SMI' {
    elasticsearch {
      index => "smi_data"
    }
  }
  else if [@metadata][type] == 'copy_only_FTSE' {
    elasticsearch {
      index => "ftse_data"
    }
  }
  else {
    elasticsearch {
      index => "stocks_original"
    }
  }
}

Testing the logstash pipeline

To test this pipeline with the example CSV data, you could execute something similar to the following command, modifying it to ensure that you use paths that are correct for your system. Note that specifying ‘config.reload.automatic’ is optional, but allows us to automatically reload ‘clones.conf’ without restarting Logstash. Remember that ‘clones.conf’ that is used below is the file that contains the pipeline described in the previous section.

./logstash -f ./clones.conf --config.reload.automatic

Once logstash has read the stocks.csv file, we can check the various outputs that have been written. We have written three indexes called ‘smi_data’, ‘ftse_data’, and ‘stocks_original’.

Check the SMI index

GET /smi_data/_search

Should display documents with the following structure. Notice that only “SMI” data appears in the ‘smi_data’ index.

      {
        "_index": "smi_data",
        "_type": "doc",
        "_id": "_QRskWUBsYalOV9y9hGJ",
        "_score": 1,
        "_source": {
          "SMI": 1688.5
        }
      }

Check the FTSE index

GET /ftse_data/_search

Should display documents with the following structure. Notice that only “FTSE” field appears in documents in the ‘ftse_data’ index.

      {
        "_index": "ftse_data",
        "_type": "doc",
        "_id": "AgRskWUBsYalOV9y9hL0",
        "_score": 1,
        "_source": {
          "FTSE": 2448.2
        }
      }

Check the original documents index

GET /stocks_originals/_search

Should display documents with the following structure. Notice that the entire original version of the documents appears in the ‘stocks_original’ index.

      {
        "_index": "stocks_original",
        "_type": "doc",
        "_id": "-QRskWUBsYalOV9y9hFo",
        "_score": 1,
        "_source": {
          "host": "Alexanders-MBP",
          "@timestamp": "2017-01-01T00:30:00.000Z",
          "SMI": 1678.1,
          "@version": "1",
          "message": "1483230600,1628.75,1678.1,1772.8,2443.6",
          "CAC": 1772.8,
          "DAX": 1628.75,
          "time": "1483230600",
          "path": "/Users/arm/Documents/ES6.3/datasets/stocks_for_clones.csv",
          "FTSE": 2443.6
        }
      }

Conclusion

In this blog post, we have demonstrated how to use Logstash to create multiple copies of an input stream, to then modify documents in each stream as required for different outputs, and to then drive the different streams into different outputs.

Using Logstash prune capabilities to whitelist sub-documents

Overview

Logstash’s prune filter plugin can make use of whitelists to ensure that only specific desired fields are output from Logstash, and that all other fields are dropped. In this blog post we demonstrate the use of Logstash to whitelist desired fields and desired sub-documents before indexing into Elasticsearch.

Example input file

As an input to Logstash, we use a CSV file that contains stock market trades. A few example CSV stock market trades are given below. 

1483230600,1628.75,1678.1,1772.8,2443.6
1483232400,1613.63,1688.5,1750.5,2460.2
1483234200,1606.51,1678.6,1718,2448.2
1483236000,1621.04,1684.1,1708.1,2470.4

The comma separated values represent  “time”, “DAX”, “SMI”, “CAC”, “FTSE” . You may wish to copy and paste the above lines into a CSV file called stocks.csv in order to execute the example command line given later in this blogpost. 

Example Logstash pipeline

Below is a Logstash pipeline which can be stored in a file called ‘stocks.conf’, that does the following:

  1. Reads stock market trades as CSV-formatted input from stdin.
  2. Maps each row of the CSV input to a JSON document, where the CSV columns map to JSON fields.
  3. Converts the time field to Unix format.
  4. Moves DAX and CAC fields into a nested structure called “my_nest”.
  5. Whitelists the “my_nest” field (which contains a sub-document) and the “SMI” field so that all other (non-whitelisted) fields will be removed.
  6. Writes the resulting documents to an Elasticsearch index called “stocks_whitelist_test”.
# For this simple example, pipe in data from stdin. 
input {
    stdin {}
}

filter {
    csv {
        columns => ["time","DAX","SMI","CAC","FTSE"]
        separator => ","
        convert => { 'DAX' => 'float'
        'SMI' => 'float'
        'CAC' => 'float'
        'FTSE' => 'float'}
    }
    date {
        match => ['time', 'UNIX']
    }
    mutate {
        # Move DAX and CAC into a sub-document 
        # called 'my_nest'
        rename => {
            "DAX" => "[my_nest][DAX]"
            "CAC" => "[my_nest][CAC]"
        }
    }
     
    # Remove everything except "SMI" and the 
    # "my_nest" sub-document 
    prune {
         whitelist_names => [ "SMI", "my_nest" ]
    }
}

output {
    stdout { codec => dots }
    elasticsearch {
        index => "stocks_whitelist_test"
    }
}

Testing the logstash pipeline

To test this pipeline with the example CSV data, you could execute something similar to the following command, modifying it to ensure that you use paths that are correct for your system:

cat ./stocks.csv | ./logstash -f ./stocks.conf

You can the check the data that you have stored in Elasticsearch by executing the following comand from Kibana’s dev console:

GET /stocks_whitelist_test/_search

Which should display documents with the following structure:

      {
        "_index": "stocks_whitelist_test",
        "_type": "doc",
        "_id": "KANygWUBsYalOV9yOKsD",
        "_score": 1,
        "_source": {
          "my_nest": {
            "CAC": 1718,
            "DAX": 1606.51
          },
          "SMI": 1678.6
        }
      }

Notice that only “my_nest” and “SMI” have been indexed as indicated by the contents of the document’s “_source”. Also note that the “FTSE” and “time” fields have been removed as they were not in the prune filter’s whitelist.

Conclusion

In this blog post, we have demonstrated how Logstash’s prune filter plugin can make use of whitelists to ensure that only specific desired fields are output from Logstash.

Deduplicating documents in Elasticsearch

Overview

In this blog post we cover how to detect and remove duplicate documents from Elasticsearch by using either Logstash or alternatively by using custom code written in Python.

Example document structure

For the purposes of this blog post, we assume that the documents in the Elasticsearch cluster have the following structure. This corresponds to a dataset that contains documents representing stock market trades.

    {
      "_index": "stocks",
      "_type": "doc",
      "_id": "6fo3tmMB_ieLOlkwYclP",
      "_version": 1,
      "found": true,
      "_source": {
        "CAC": 1854.6,
        "host": "Alexanders-MBP",
        "SMI": 2061.7,
        "@timestamp": "2017-01-09T02:30:00.000Z",
        "FTSE": 2827.5,
        "DAX": 1527.06,
        "time": "1483929000",
        "message": "1483929000,1527.06,2061.7,1854.6,2827.5\r",
        "@version": "1"
      }
    }

Given this example document structure, for the purposes of this blog we arbitrarily assume that if multiple documents have the same values for the [“CAC”, “FTSE”, “SMI”] fields that they are duplicates of each other.

Using logstash for deduplicating Elasticsearch documents

Logstash may be used for detecting and removing duplicate documents from an Elasticsearch index. This process is described in this blogpost.

In the example below I have written a simple Logstash configuration that reads documents from an index on an Elasticsearch cluster, then uses the fingerprint filter to compute a unique _id value for each document based on a hash of the [“CAC”, “FTSE”, “SMI”] fields, and finally writes each document back to a new index on that same Elasticsearch cluster such that duplicate documents will be written the the same _id and therefore eliminated.

Additionally, with minor modifications, the same Logstash filter could also be applied to future documents written into the newly created index in order to ensure that duplicates are removed in near real-time. This could be accomplished by changing the input section in the example below to accept documents from your real-time input source rather than pulling documents from an existing index.

Be aware that using custom _id values (i.e. an _id that is not generated by Elasticsearch) will have some impact on the write performance of your index operations. 

Also, it is worth noting that depending on the hash algorithm used, this approach may theoretically result in a non-zero number of hash collisions for the _id value, which could theoretically result in two non-identical documents being mapped to the same _id. For most practical cases, the probability of a has collision is likely very low.

A simple Logstash configuration to de-duplicate an existing index using the fingerprint filter is given below.

input {
  # Read all documents from Elasticsearch 
  elasticsearch {
    hosts => "localhost"
    index => "stocks"
    query => '{ "sort": [ "_doc" ] }'
  }
}

filter {
    fingerprint {
        key => "1234ABCD"
        method => "SHA256"
        source => ["CAC", "FTSE", "SMI"]
        target => "[@metadata][generated_id]"
    }
}

output {
    stdout { codec => dots }
    elasticsearch {
        index => "stocks_after_fingerprint"
        document_id => "%{[@metadata][generated_id]}"
    }
}

A custom Python script for deduplicating Elasticsearch documents

A memory-efficient approach

If Logstash is not used, then deduplication may be efficiently accomplished with a custom python script. For this approach, we compute the hash of the [“CAC”, “FTSE”, “SMI”] fields that we have defined to uniquely identify a document. We then use this hash as a key in a python dictionary, where the associated value of each dictionary entry will be an array of the _ids of the documents that map to the same hash.

If more than one document has the same hash, then the duplicate documents that map to the same hash can be deleted.  Alternatively, if you are concerned about the possibility of hash collisions, then the contents of documents that map to the same hash can be examined to see if the documents are really identical, and if so, then duplicates can be eliminated.

Detection algorithm analysis

For a 50GB index, if we assume that the index contains 0.4 kB documents, then there would be 125 million documents in the index. In this case the amount memory required for storing the deduplication data structures in memory when using a 128-bit md5 hash would be on the order of 128 bits x 125 Million = 2GB of memory, plus the 160 bit _ids will require another 160 bits x 125 Million = 2.5 GB of of memory. This algorithm will therefore require on the order of 4.5GB RAM to keep all relevant data structures in memory. This memory footprint can be dramatically reduced if the approach discussed in the following section can be applied.

Potential enhancement to the algorithm to reduce memory usage, as well as to continuously remove new duplicate documents

If you are storing time-series data, and you know that duplicate documents will only occur within some small amount of time of each other, then you may be able to improve this algorithm’s memory footprint by repeatedly executing the algorithm on a subset of the documents in the index, with each subset corresponding to a different time window. For example, if you have a years worth of data then you could use range queries on your datetime field (inside a filter context  for best performance) to step through your data set one week at a time. This would require that the algorithm is executed 52 times (once for each week) — and in this case, this approach would reduce the worst-case memory footprint by a factor of 52.

In the above example, you may be concerned about not detecting duplicate documents that span between weeks. Lets assume that you know that duplicate documents cannot occur more than 2 hours apart. Then you would need to ensure that each execution of the algorithm includes documents that overlap by 2 hours with the last set of documents analyzed by the previous execution of the algorithm. For the weekly example, you would therefore need to query 170 hours (1 week + 2 hours) worth of time-series documents to ensure that no duplicates are missed.

If you wish to periodically clear out duplicate documents from your indices on an on-going basis, you can execute this algorithm on recently received documents. The same logic applies as above — ensure that recently received documents are included in the analysis along with enough of an overlap with slightly older documents to ensure that duplicates are not inadvertently missed.

Python code to detect duplicate documents

The following code demonstrates how documents can can be efficiently evaluated to see if they are identical, and then eliminated if desired. However, in order to prevent accidental deletion of documents, in this example we do not actually execute a delete operation. Such functionality would be straightforward to include.

The code to deduplicate documents from Elasticsearch can also be found on github.

#!/usr/local/bin/python3
import hashlib
from elasticsearch import Elasticsearch
es = Elasticsearch(["localhost:9200"])
dict_of_duplicate_docs = {}

# The following line defines the fields that will be
# used to determine if a document is a duplicate
keys_to_include_in_hash = ["CAC", "FTSE", "SMI"]


# Process documents returned by the current search/scroll
def populate_dict_of_duplicate_docs(hits):
    for item in hits:
        combined_key = ""
        for mykey in keys_to_include_in_hash:
            combined_key += str(item['_source'][mykey])

        _id = item["_id"]

        hashval = hashlib.md5(combined_key.encode('utf-8')).digest()

        # If the hashval is new, then we will create a new key
        # in the dict_of_duplicate_docs, which will be
        # assigned a value of an empty array.
        # We then immediately push the _id onto the array.
        # If hashval already exists, then
        # we will just push the new _id onto the existing array
        dict_of_duplicate_docs.setdefault(hashval, []).append(_id)


# Loop over all documents in the index, and populate the
# dict_of_duplicate_docs data structure.
def scroll_over_all_docs():
    data = es.search(index="stocks", scroll='1m',  body={"query": {"match_all": {}}})

    # Get the scroll ID
    sid = data['_scroll_id']
    scroll_size = len(data['hits']['hits'])

    # Before scroll, process current batch of hits
    populate_dict_of_duplicate_docs(data['hits']['hits'])

    while scroll_size > 0:
        data = es.scroll(scroll_id=sid, scroll='2m')

        # Process current batch of hits
        populate_dict_of_duplicate_docs(data['hits']['hits'])

        # Update the scroll ID
        sid = data['_scroll_id']

        # Get the number of results that returned in the last scroll
        scroll_size = len(data['hits']['hits'])


def loop_over_hashes_and_remove_duplicates():
    # Search through the hash of doc values to see if any
    # duplicate hashes have been found
    for hashval, array_of_ids in dict_of_duplicate_docs.items():
      if len(array_of_ids) > 1:
        print("********** Duplicate docs hash=%s **********" % hashval)
        # Get the documents that have mapped to the current hasval
        matching_docs = es.mget(index="stocks", doc_type="doc", body={"ids": array_of_ids})
        for doc in matching_docs['docs']:
            # In order to remove the possibility of hash collisions,
            # write code here to check all fields in the docs to
            # see if they are truly identical - if so, then execute a
            # DELETE operation on all except one.
            # In this example, we just print the docs.
            print("doc=%s\n" % doc)



def main():
    scroll_over_all_docs()
    loop_over_hashes_and_remove_duplicates()


main()

 

Alexander Marquardt – Patents

Method and system for estimating the reliability of blacklists of botnet-infected computers

Versatile logic element and logic array block

  • US patent 7,218,133
  • Issued May 15, 2007 
  • See patent.

Routing Architecture For A Programmable Logic Device

  • US patent 6,970,014 
  • Issued Nov 29, 2005
  • See patent.

Alexander Marquardt Publications

Book

Architecture and CAD for Deep Sub-Micron FPGAs

Thesis

Cluster-Based Architecture, Timing-Driven Packing, and Timing-Driven Placement for FPGAs

Scientific Papers

The Stratix II Logic and Routing Architecture

The Stratix Routing and Logic Architecture

Speed and Area Trade-Offs in Cluster-Based FPGA Architectures

Timing-Driven Placement for FPGAs

Using Cluster Based Logic Blocks and Timing-Driven Packing to Improve FPGA Speed and Density

 

Trade-offs to consider when storing binary data in MongoDB

In this blog post we discuss several methods for storing binary data in MongoDB, and the trade-offs associated with each method.

Introduction

When using MongoDB there are several approaches that make it easy to store and retrieve binary data, but it is not always clear which approach is the most appropriate for a given application. Therefore, in this blog post we discuss several methods for storing binary data when using MongoDB and the trade-offs associated with each method. Many of the trade-offs discussed here would likely apply to most other databases as well.

Overview

In this blog post, we cover the following topics:

  • Definition of binary data
  • Methods for storing binary data with MongoDB
  • General trade-offs to consider when storing binary data in MongoDB
  • Considerations for embedding binary data in MongoDB documents
  • Considerations for storing binary data in MongoDB with GridFS
  • Considerations for storing binary data outside of MongoDB
  • Mitigating some of the drawbacks of storing binary data in MongoDB
  • Conclusion

Definition of binary data

While everything in a computer is technically stored as binary data, in the context of this posting when we refer to Binary Data we are referring to unstructured BLOB data such as PDF documents or Jpeg images. This is in contrast to structured data such as text or integers.

When considering trade-offs of storing binary data in MongoDB, we assume that the size of the binary data is hundreds of kilobytes or larger (eg. PDFs, Jpegs, etc.). If smaller binary objects are stored (eg. small thumbnails), then the drawbacks of storing binary data in MongoDB will be minimal and would be outweighed by the benefits.

Methods for storing binary data with MongoDB

Embed binary data in MongoDB documents using the BSON BinData type

MongoDB enforces a limit of 16MB per document, and so if binary data plus other fields in a document are guaranteed to be less than 16MB, then binary data can be embedded into documents by using the BinData BSON type. For more information using the BinData type, see the documentation for the appropriate driver.

Store binary data in MongoDB using GridFS

GridFS is a specification for storing and retrieving files that exceed the BSON document size limit of 16MB. Instead of storing a file in a single document, GridFS divides the file into chunks, and stores each chunk as a separate document. GridFS uses two collections to store files, where one collection stores the file chunks, and the other stores file metadata as well as optional application-specific fields.

An excellent two-part blog post (part 1 and part 2) gives a good overview of how GridFS works, and examples of when GridFS would be an appropriate storage choice.

Store binary data in an external system, and use MongoDB to store the location of the binary data

Data may be stored outside of the database, for example in a file system. The document in the database could then contain all the required information to know where to look for the associated binary data (a key, a file name, etc.), and it could also store metadata associated with the binary data.

General trade-offs to consider when storing binary data in MongoDB

Benefits of storing binary data in MongoDB

  • MongoDB provides high availability and replication of data.
  • A single system for all types of data results in a simpler application architecture.
  • When using geographically distributed replica sets, MongoDB will automatically distribute data to geographically distinct data centres.
  • Storing data in the database takes advantage of MongoDB authentication and security mechanisms.

Drawbacks of storing binary data in MongoDB

  • In a replica set deployment, data that is stored in the database will be replicated  to multiple servers, which uses more storage space and bandwidth than a single copy of the data would require.
  • If the binary data is large, then loading the binary data into memory may cause frequently accessed text (structured data) documents to be pushed out of memory, or more generally, the working set might not fit into RAM. This can negatively impact the performance of the database.
  • If binary data is stored in the database, then most backup methods will back up the binary data at the same frequency as all of the other data in the database. However, if binary data rarely changes then it might be desirable to back it up at a lower frequency than the rest of the data.
  • Storing binary data in the database will make the database larger than it would otherwise be, and if a sharded environment is deployed, this may cause balancing between shards to take more time and to be more resource intensive.
  • Backups of a database containing binary data will require more resources and storage than backups of a database without binary data.
  • Binary data is likely already compressed, and therefore will not gain much benefit from WiredTiger’s compression algorithms.

Considerations for embedding binary data in MongoDB documents

Benefits of embedding binary data in MongoDB documents

  • All of the benefits listed in the General trade-offs section.
  • If binary data that is stored in a document is always used at the same time as the other fields in that document, then all relevant information can be retrieved from the database in a single call, which should provide good performance.
  • By embedding binary data in a document, it is guaranteed that the binary data along with the rest of the document is written atomically.
  • It is simple to embed binary data into documents.

Drawbacks of embedding binary data in MongoDB documents

  • All of the drawbacks listed in the General trade-offs section.
  • There is a limit of 16MB per document. Embedding large binary objects in documents may risk the documents growing beyond this size limit.
  • If the structured metadata in a document is frequently accessed, but the associated embedded binary data is rarely required, then the binary data will be loaded into memory much more often than it is really needed. This is because of the fact that in order to read the structured metadata, the entire document including the embedded binary data is loaded into RAM. This will needlessly waste valuable memory resources and may cause the working set to be pushed out of RAM. This may result in increased disk IO as well as a slower database response time. This drawback can be addressed by separating the binary and associated metadata as we discuss later in this document.

Considerations for storing binary data in MongoDB with GridFS

Benefits of storing binary data in GridFS

  • All of the benefits listed in the General trade-offs section.
  • If the binary data is large and/or arbitrary in size, then GridFS may be used to overcome the 16MB limit.
  • GridFS doesn’t have the limitations of some filesystems, like number of documents per directory, or file naming rules.
  • GridFS can be used to recall sections of large files without reading the entire file into memory.
  • GridFS can efficiently stream binary data by loading only a portion of a given binary file into memory at any point in time.

Drawbacks of storing binary data in GridFS

  • All of the drawbacks listed in the General trade-offs section.
  • Versus embedding binary data directly into documents, GridFS will have some additional overhead for splitting, tracking, and stitching together the binary data.
  • Binary data in GridFS is immutable.

Considerations for storing binary data outside of MongoDB

Benefits of storing binary data outside of MongoDB

  • Storing data outside of MongoDB may reduce storage costs by eliminating MongoDB’s replication of data.
  • Binary data outside of MongoDB can be backed up at a different frequency than the database.
  • If binary data is stored in a separate system, then binary data will not be loaded into the same RAM as the database. This removes the risk of binary data pushing the working set out of RAM, and ensures that performance will not be negatively impacted.

Drawbacks of storing binary data outside of MongoDB

  • Adding an additional system for the storage of binary data will add significant operational overhead and system complexity versus using MongoDB for this purpose.
  • Alternative binary data storage systems will not take advantage of MongoDB’s high availability and replication of data.
  • Binary data will not be automatically replicated to geographically distinct data centres.
  • Binary data will not be written atomically with respect to its associated document.
  • Alternative binary data storage systems will not use MongoDB’s authentication and security mechanisms.

Mitigating some of the drawbacks of storing binary data in MongoDB

Separating metadata and embedded binary data into different collections

Instead of embedding binary data and metadata together, it may make sense for binary data to be stored in a collection that is separate from the metadata. This would be useful if metadata is accessed frequently but the associated binary data is infrequently required.

With this approach, instead of embedding the binary data in the metadata collection, the metadata collection would instead contain pointers to the relevant document in a separate binary data collection, and the desired binary data documents would be loaded on-demand. This would allow the documents from the metadata collection to be frequently accessed without wasting memory for the rarely used binary data.

Separating metadata and binary data into different MongoDB deployments

With this approach as in the previous approach, metadata and binary data are separated into two collections, but additionally these collections are stored on separate MongoDB deployments.

This would allow the binary data deployment to be scaled differently than the metadata deployment. Another benefit of this separation would be that the binary data can be backed up with a different frequency than the metadata.

Note that if metadata is separated into a separate deployment from the associated binary data then special care will have to be taken to ensure the consistency of backups.

Conclusion

There are several trade-offs to consider when deciding on the best approach for an application to store and retrieve binary data. In this blog post we have discussed several advantages and disadvantages of the most common approaches for storing binary data when working with MongoDB, as well as some techniques for minimising any drawbacks.

How to generate unique identifiers for use with MongoDB

Various techniques for generating custom unique identifiers in MongoDB

standard_barcodes

This blog article has been re-published with my permission by MongoDB at https://www.mongodb.com/blog/post/generating-globally-unique-identifiers-for-use-with-mongodb

Motivation

By default, MongoDB generates a unique ObjectID identifier that is assigned to the _id field in a new document before writing that document to the database. In many cases the default unique identifiers assigned by MongoDB will meet application requirements. However, in some cases an application may need to create custom unique identifiers, such as:

  • The application may require unique identifiers with a precise number of digits. For example, unique 12 digit identifiers might be required for bank account numbers.
  • Unique identifiers may need to be generated in a monotonically increasing and continuous sequential order.
  • Unique identifiers may need to be independent of a specific database vendor.

Due to the multi-threaded and distributed nature of modern applications, it is not always a straightforward task to generate unique identifiers that satisfy application requirements.

Overview

This posting covers the following topics:

  • How to guarantee identifier uniqueness at the database level
  • Use ObjectID as a unique identifier
  • Use a single counter document for generating unique identifiers
  • Use a single counter document that allocates batches of unique identifiers
  • Use multiple counter documents that allocate batches of unique identifiers
  • Randomly generate a unique identifier and retry if it is already assigned
  • Use a standard UUID algorithm for application-level unique identifier generation

How to guarantee identifier uniqueness at the database level

All of the approaches that we propose here generate identifiers that are globally unique for all practical purposes, however depending on the chosen approach there may be remote edge cases where the generated identifier may not be absolutely globally unique.

If the risk of a clash of unique identifiers is deemed to be sufficiently remote or the consequences of such a clash are minor, then one may consider writing to the database without additional uniqueness checking. This is a valid approach that is likely to be adopted by many applications.

However, if it is absolutely essential to enforce that the generated identifier is globally unique, code may be written defensively to guarantee that the database will never write the same unique identifier twice. This is relatively easy to implement on a non-sharded collection by specifying a unique index on a particular field, which prevents a duplicate value for that field from ever being written to the collection. Note that by default the _id field always obeys the unique index constraint.

If a write fails due to a collision on a field that has a unique index, then the application should catch this error and generate a new unique identifier and try to write the document to the database again.

If a collection is sharded, there are restrictions on the use of unique indexes. If the restrictions for using a unique index on a sharded collection cannot be satisfied, then a proxy collection may be used to guarantee uniqueness before writing data to the database.

Use ObjectID as a unique identifier

Description

MongoDB database drivers by default generate an ObjectID identifier that is assigned to the _id field of each document. If a document arrives to the database without an _id value, then the database itself will assign an ObjectID to the _id field. In many cases the ObjectID may be used as a unique identifier in an application.

ObjectID is a 96-bit number which is composed as follows:

  • a 4-byte value representing the seconds since the Unix epoch (which will not run out of seconds until the year 2106)
  • a 3-byte machine identifier (usually derived from the MAC address),
  • a 2-byte process id, and
  • a 3-byte counter, starting with a random value.

Benefits

  • ObjectID is automatically generated by the database drivers or in some cases by the database, and will be assigned to the _id field of each document.
  • ObjectID can be considered globally unique for most practical purposes.
  • ObjectID encodes the timestamp of its creation time, which may be used for queries or to sort by creation time.
  • ObjectID is mostly monotonically increasing,
  • ObjectID is 96-bits which is less than some other UUID implementations, which will result in slightly less disk and RAM usage that these alternative solutions.

Drawbacks

  • At 96 bits, the ObjectId is longer than some of the alternative solutions, which means that it may require slightly more disk space and RAM than these alternatives.
  • The ObjectID is generated by MongoDB drivers or by the database itself. Some businesses may be reluctant to link their application logic to an identifier that is generated by the database.
  • ObjectID can be considered globally unique for most practical purposes, but there are edge cases where it may not be truly globally unique.

Single counter document solution

Disclaimer

The approach described in this section is generally not recommended due to the potential for the single counter document to become a bottleneck in the application.

Description

A unique identifier may be required in a monotonically increasing and continuous sequential order, which is similar to the sequence functionality that is implemented by some RDBMSs.

This may be achieved by implementing a solution that is based on a centralised counter document that resides in a collection that we have called uniqueIdentifierCounter. This centralised counter document is the only document in the uniqueIdentifierCounter collection, and its COUNT field will track the current unique identifier value.

Each time a new unique identifier is needed, findAndModify will be used on the counter document to atomically increment the COUNT field and return the pre-increment (original) document that has a unique COUNT value. The COUNT value can then be used by the application as a unique identifier. For example the application could assign the COUNT value to the _id field of a document that will be written to a given collection.

Example implementation

A counter document for unique identifier generation could look as follows:

{
    "_id"   : "UNIQUE COUNT DOCUMENT IDENTIFIER",
    "COUNT" : 0,
    "NOTES" : “Increment COUNT using findAndModify to ensure that the COUNT field will be incremented atomically with the fetch of this document",
}

And the unique identifier-generation document could be atomically requested and incremented as follows. Note that by default the document returned from findAndModify is the pre-modification document:

db.uniqueIdentifierCounter.findAndModify({
    query: { _id: "UNIQUE COUNT DOCUMENT IDENTIFIER" },
    update: {
        $inc: { COUNT: 1 },
    },
    writeConcern: 'majority'
})

Benefits

  • Easy to implement
  • Unique identifiers are generated in a continuous and monotonically increasing manner.

Drawbacks

  • This approach will likely generate a serious bottleneck in the system, as there will be contention caused by many threads simultaneously accessing the single counter document.
  • Depending on replication lag and time to flush the counter document to disk, this technique will limit the speed of unique identifier generation. If we assume that it takes 25ms for the counter document to be persisted and replicated to the database then this method would only be able to generate 40 new unique identifiers per second. If the application is waiting for unique identifier values before new documents can be inserted into a given collection, then these inserts will have a maximum write speed of 40 documents per second. Without such a bottleneck, we would expect a well functioning database to be able to write tens of thousands of documents per second.

Single counter document with range allocation

Description

This approach is similar to the previous approach, with the difference being that instead of incrementing the COUNT value by 1, we may wish to increment it by a larger number that will represent a batch of unique identifiers that will be allocated by the database to the application.

For example, if the application knows that it needs 1000 new unique identifiers, then the application would use findAndModify() to atomically get the current COUNT and increment the COUNT value by 1000. The document returned from the findAndModify command would contain the starting value for the batch of unique identifiers, and the application would loop over 1000 values from that starting point.

Note that with this approach an application may pass in whatever value it wishes for incrementing the COUNT value, and therefore this approach can be made to be flexible. For large batches this increment would be a large number, and for a single identifier this would be set to 1.

Example implementation

The following demonstrates the javascript shell commands that would atomically increment the COUNT by 1000 and return the previous (before the increment) counter document:

var seq_increment = 1000;
db.uniqueIdentifierCounter.findAndModify({
    query: { _id: "UNIQUE COUNT DOCUMENT IDENTIFIER" },
    update: {
        $inc: {COUNT: seq_increment },
    }
    writeConcern: 'majority'
})

Benefits

  • Relatively easy to implement.
  • Unique identifiers are generated by the database in a monotonically increasing manner, and will likely be used by the application in a mostly monotonically increasing manner.
  • This approach has the potential to dramatically reduce the number of requests and updates to the counter document, which may eliminate the bottleneck described in the previous approach.

Drawbacks

  • There is potential for bottlenecks if the application requests small unique identifier ranges with each request.
  • The application must understand that the number received from the database is meant as a range of values.
  • Multiple threads requesting batches of unique identifiers at a similar moment in time could cause the allocated identifiers to be used by the application in an order that is not strictly monotonically increasing, as each thread takes time to work through its allocated batch
  • If the application misbehaves, then it could burn through a large number of unique identifiers without actually using them.

Multiple range allocator counter documents

Description

This approach is similar to the previous approach, but instead of having a single counter document, we could have many counter documents stored in the uniqueIdentifierCounter collection.

For example, there may be 1000 counter documents (numbered 0 to 999) each responsible for allocating 1 billion unique numbers that are taken from a specific range that has been allocated to each counter. In this case, counter 499 would be responsible for allocating values from 499000000000 to 499999999999.

Note that this particular example results in unique numbers ranging from 0 to 999,999,999,999 which is a 12 digit number.

Example implementation

Below we show the format and initial values assigned to the counter documents in the uniqueIdentifierCounter collection:

/* The following would be the initial state of the 0th counter document, which is responsible for the range of unique identifiers from 0 to 999,999,999 */
{
    "_id"  : "COUNTER DOCUMENT NUMBER 0",
    "MAX_VALUE": 999999999,
    "COUNT"  : 0,
    "NOTES" : "Increment COUNT using findAndModify to ensure that the COUNT field will be incremented atomically with the fetch of this document",
}
 
/* The following would be the initial state of 499th counter document, which is responsible for the range of unique identifiers from 499,000,000,000 to 499,999,999,999 */
{
    "_id"  : "COUNTER DOCUMENT NUMBER 499"),
    "MAX_VALUE": 499999999999,
    "COUNT"  : 499000000000,
    "NOTES" : "Increment COUNT using findAndModify to ensure that the COUNT field will be incremented atomically with the fetch of this document",
}
/* Etc… */

With this approach, each time the application needs a new unique number or a range of unique numbers, the application would randomly generate a number between 0 and 999 which it would use to perform a query against the _id attribute in the uniqueIdentifierCounter collection. This would select a particular counter document from which to retrieve the unique identifier(s).

This is demonstrated in the following example, in which we randomly select one of the counter documents, and request a batch of 100 unique numbers from that counter:

var which_counter_to_query = Math.floor((Math.random()*1000));
var seq_increment = 100;
db.Unique Identifier_counter.findAndModify({
    query: { _id:  "COUNTER DOCUMENT NUMBER " + which_counter_to_query},
    update: {
        $inc: {COUNT: seq_increment },
    },
    writeConcern: 'majority'
})

Benefits

  • Compared to the previous approaches, this approach will reduce contention by having relatively fewer threads simultaneously accessing each counter document.

Drawbacks

  • This is more complicated than the previous implementations.
  • There is still the potential for bottlenecks if this approach is used for generating small batches and has a small number of counter documents.
  • The number of counter documents must be predefined and the range of unique identifiers assigned to each counter document needs to be allocated in advance.
  • Care must be taken to ensure that the pre-defined range that is assigned to each counter document does not roll-over.

Randomly choose a unique identifier in the application and retry if it is already used

Description

This approach relies on the fact that if a unique index is defined in a collection, that any document written to that collection must have a unique value assigned to that field in order for it to be successfully written. Therefore, we can randomly generate a unique identifier number in the application, assign it to the unique field in a document, and try to write that document to the collection. If the write succeeds, then we know that the value that we assigned to it is unique. If the write fails, then the application must catch the failure and randomly generate a new unique identifier which can then be used to write the document to the collection. If the number of collisions is low, then this can be an efficient way to write documents to the database with each document having a guaranteed unique identifier.

Example implementation

If we know that we will have a maximum of one billion records in our database, in order to have a low probability of selecting an identifier that has already been assigned (aka a collision), we could randomly choose a number between 0 and 999,999,999,999 (from zero to one trillion minus one). For this example, the range that we are using to select the random number is 1000 times bigger than the number of documents that we expect to write to this collection, which results in a worst case expected 0.1% chance of a collision.

We would then assign the randomly generated number to a field that has a unique index, and write that document to the database. If the write succeeds, then we know that the identifier is unique. If the write fails, then the application must randomly choose another identifier and try again until the write succeeds. Note that there are some restrictions on the use of unique indexes when used in sharded clusters. If the restrictions for using a unique index on a sharded collection cannot be satisfied, then a proxy collection may be used to help generate unique indexes.

Benefits

  • This approach is not difficult to implement.

Drawbacks

  • Some businesses do not like the random nature of this approach to unique identifier generation.
  • Testing unique identifier collisions may be tricky since it relies on random numbers.
  • If the random number generator is not good then there may be a potential for multiple collisions and multiple retries required before successfully writing a document.

Generate unique identifiers in the application

Description

RFC 4122 defines a standard for generating 128-bit UUIDs. The RFC specifies different algorithms for generating UUIDs, which are called UUID versions.

Standard libraries exist  that will generate 128-bit UUIDs at the application level. The BSON specification defines UUID as a valid subtype, and can be found at: http://bsonspec.org/spec.html.

Benefits

  • This approach effectively addresses UUID-related bottlenecks.
  • UUIDs are fully generated in application code, which saves a round-trip to the database that is required with some of the alternative approaches.
  • This is a standard implementation that relies on existing libraries
  • This does not use an internally generated database identifier for business functions.
  • For all practical purposes, UUIDs generated with this approach are globally unique.
  • If RFC 4122 version 1 or version 2 is used and is properly implemented, then the generated UUIDs are guaranteed to be globally unique.

Drawbacks

  • These unique identifiers are 128-bits, which is longer than some alternative approaches. This results in slightly larger documents, and therefore uses slightly more disk space and memory.
  • For RFC 4122 algorithm versions other than version 1 and version 2, it is theoretically possible to generate a UUID that is not absolutely globally unique. However, for all practical purposes the resulting UUIDs are globally unique. More information can be found in the Wikipedia article about UUID generation..