Combining Elasticsearch stemmers and synonyms to improve search relevance

The article called The same, but different: Boosting the power of Elasticsearch with synonyms gives a great introduction to why and how you can incorporate synonyms into your Elasticsearch-powered application. Here I build upon that blog and show how you can combine stemmers and multi-word synonyms to take the quality of your search results to the next level.

Motivation

Imagine that you are using Elasticsearch to power a search application for finding books, and in this application you want to treat the following words as synonyms:

  • brainstorm
  • brainstorming
  • brainstormed
  • brain storm
  • brain storming
  • brain stormed
  • envisage
  • envisaging
  • envisaged
  • etc.

It is tedious and error prone to explicitly use synonyms to define all possible conjugations, declensions, and inflections of a word or of a compound word.

However, it is possible to reduce the size of the list of synonyms by making use of a stemmer to extract the stem of each word before applying synonyms. This would allow us to get the same results as the above synonym list by specifying only the following synonyms:

  • brainstorm
  • brain storm
  • envisage

Custom analyzers

In this section, I show code snippets that define custom analyzers that can be used for matching synonyms. Later on in this blog I show how to submit the analyzers to Elasticsearch.

The blog called The same, but different: Boosting the power of Elasticsearch with synonyms goes into details on the difference between index-time and search-time synonyms. In the solution presented here, I make use of search-time synonyms.

We will create a synonym graph token filter that matches multi-word synonyms and will be called “my_graph_synonyms” as follows:

        "filter": {
          "my_graph_synonyms": {
            "type": "synonym_graph",
            "synonyms": [
              "mind, brain",
              "brain storm, brainstorm, envisage"
            ]
          }
        }

Next we need to define two separate custom analyzers, one that will be applied to text at index-time, and another that will be applied to text at search-time.

We define an analyzer called “my_index_time_analyzer” which uses the standard tokenizer and the lowercase token filter and the stemmer token filter as follows:

      "my_index_time_analyzer": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "stemmer"
        ]
      }

We define an analyzer called “my_search_time_analyzer”, which also makes use of the standard tokenizer and the lowercase token filter and the stemmer token filter (as above). However, this also includes our custom token filter called “my_graph_synonyms”, which ensures that synonyms will be matched at search-time:

      "my_search_time_analyzer": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "stemmer",
          "my_graph_synonyms"
        ]
      }

Mappings

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. Each document is a collection of fields, which each have their own data type. In this example we define the mapping for a document with a single field called “my_new_text_field”, which we define as “text”. This field will make use of “my_index_time_analyzer” when documents are indexed, and will make use of “my_search_time_analyzer” when documents are searched. The mapping looks as follows:

  "mappings": {
    "properties": {
      "my_new_text_field": {
        "type": "text",
        "analyzer": "my_index_time_analyzer",
        "search_analyzer": "my_search_time_analyzer"
      }
    }
  }

Bringing it together

Below we bring together our custom analyzers and mappings and apply it to an index called “test_index” as follows:

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_graph_synonyms": {
            "type": "synonym_graph",
            "synonyms": [
              "mind, brain",
              "brain storm, brainstorm, envisage"
            ]
          }
        },
        "analyzer": {
          "my_index_time_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "stemmer"
            ]
          },
          "my_search_time_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "stemmer",
              "my_graph_synonyms"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_new_text_field": {
        "type": "text",
        "analyzer": "my_index_time_analyzer",
        "search_analyzer": "my_search_time_analyzer"
      }
    }
  }
}

Testing our custom search-time analyzer

If we wish to see how an analyzer is tokenizing and normalizing a given string, we can directly call the _analyze api as follows:

POST test_index/_analyze
{
  "text" : "Brainstorm",
  "analyzer": "my_search_time_analyzer"
}

Testing on real documents

We can use the _bulk API to drive several documents into Elasticsearch as follows:

POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_new_text_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_new_text_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_new_text_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_new_text_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_new_text_field": "I envisaged something like that" }

After driving the sample documents into “test_index”, we can execute a search that will correctly respond with document #1, #2, #3 and #5, as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": "brain storm"
    }
  }
}

We can execute the following search which correctly returns only documents #2 and #4, as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": "brain"
    }
  }
}

We can execute the following search which will correctly respond with document #1, #2, #3 and #5, as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": "brainstorming"
    }
  }
}

We can execute the following search which correctly returns documents #2 and #4, as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": "mind storm"
    }
  }
}

And finally, we can execute the following search which correctly returns only documents #2 and #4 as follows:

GET test_index/_search
{
  "query": {
    "match": {
      "my_new_text_field": {
        "query": "storm brain"
      }
    }
  }
}

Conclusion

In this blog I demonstrated how you can combine stemmers and multi-word synonyms in Elasticsearch to improve the quality of your search results.