Elasticsearch Analysis

Palavras-chave:

Publicado em: 28/08/2025

Elasticsearch Analysis: Diving Deep into Text Processing

Elasticsearch analysis is the process of converting text into a searchable and indexable format. This involves breaking down text into individual terms, normalizing those terms, and removing irrelevant information. Understanding analysis is crucial for building effective Elasticsearch search applications. This article explores the fundamental concepts and implementation of Elasticsearch analysis.

Fundamental Concepts / Prerequisites

Before diving into Elasticsearch analysis, it's important to understand the following concepts:

Documents and Fields: Elasticsearch stores data in JSON documents, which contain fields. Analysis is primarily applied to text fields.
Indexing: The process of creating a searchable index from documents. Analysis occurs during indexing.
Querying: The process of searching the index for documents that match a query. Analysis also occurs during querying.
Analyzers: The core components responsible for performing analysis. An analyzer typically consists of a character filter, a tokenizer, and token filters.
Character Filters: Pre-processing steps that modify the text before tokenization. Examples include HTML stripping and character replacement.
Tokenizers: Break down the text into individual tokens (words). Examples include standard, whitespace, and keyword tokenizers.
Token Filters: Post-processing steps that modify the tokens after tokenization. Examples include lowercase, stop word removal, and stemming.

Configuring an Analyzer

This section demonstrates how to define and use a custom analyzer in Elasticsearch. We'll create an analyzer that lowercases text, removes stop words, and applies stemming.


PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "stop",
            "porter_stem"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text_field": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

PUT /my_index/_doc/1
{
  "my_text_field": "The QUICK brown FOX jumped over the lazy DOGS."
}

GET /my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "The QUICK brown FOX jumped over the lazy DOGS."
}

Code Explanation

The first part of the code defines an index named `my_index` with custom settings for analysis.

The `"analysis"` section configures the analyzer. We define `my_custom_analyzer` with the type `"custom"`. This tells Elasticsearch that we want to build our analyzer from components. We specify the `"tokenizer"` as `"standard"`, which splits the text into words based on whitespace and punctuation.

The `"filter"` array defines the token filters to apply. `"lowercase"` converts all tokens to lowercase. `"stop"` removes common words like "the" and "a" (stop words). `"porter_stem"` applies the Porter stemming algorithm, which reduces words to their root form (e.g., "jumped" becomes "jump").

The `"mappings"` section defines the structure of documents within the index. The `"my_text_field"` field is of type `"text"` and is associated with the `my_custom_analyzer`. This means that when documents are indexed, the text in this field will be analyzed using our custom analyzer.

The second part indexes a sample document. The `"my_text_field"` contains the sentence "The QUICK brown FOX jumped over the lazy DOGS."

The third part uses the `_analyze` endpoint to analyze the provided text with our custom analyzer. The response will show the tokens generated after the analysis process, which are "quick", "brown", "fox", "jump", "over", "lazi", "dog".

Complexity Analysis

The complexity of Elasticsearch analysis depends on the complexity of the individual components (character filters, tokenizer, and token filters) used in the analyzer.

Time Complexity: The time complexity is generally O(n) where n is the length of the input text. However, specific token filters like stemming can have a more complex time complexity depending on the algorithm used.

Space Complexity: The space complexity is primarily determined by the size of the tokens generated. In most cases, it is also O(n), where n is the length of the input text, as the number of tokens is generally proportional to the input text size. Stop word removal reduces space complexity. The overall space complexity depends upon the terms/number of unique tokens after processing.

Alternative Approaches

While custom analyzers provide fine-grained control over analysis, Elasticsearch offers several built-in analyzers that can be used directly. For example, the `standard` analyzer provides reasonable defaults for general-purpose text analysis, while the `whitespace` analyzer simply splits text on whitespace. Using built-in analyzers can be simpler to configure but offers less flexibility in customizing the analysis process. Another approach is to use language specific analyzers, like `english` analyzer which handles english stemming and stop words.

Conclusion

Elasticsearch analysis is a fundamental aspect of building effective search applications. Understanding the components of an analyzer—character filters, tokenizers, and token filters—allows you to customize the analysis process to meet the specific needs of your application. By carefully configuring analyzers, you can improve search relevance and ensure that users find the information they are looking for. This article showed how to configure a custom analyzer, and the trade-offs of using pre-built alternatives.