Index Mappings
An index mapping defines how documents and their fields are stored and indexed in Elasticsearch. Maestro is responsible for taking published Song metadata and translating it into Elasticsearch documents. With these documents Arranger uses the index mapping and generates our GraphQL server which enables fast and flexible queries.
Depending on how Maestro is configured it can index data into documents in one of two ways:
-
File Centric Indexing: Each document indexed in Elasticsearch describes all information central to a specific file. Click here to see an example of a file centric JSON document.
-
Analysis Centric Indexing Each document indexed in Elasticsearch describes all information central to a specific analysis. Click here to see an example of an analysis centric JSON document.
File or Analysis Centric IndexingIf your queries focus on individual files and their attributes, choose file-centric indexing. If your queries center on analyses/participants and their associated data, choose analysis-centric indexing.
The index mapping can be defined within an index template supplied to Elasticsearch on startup. In the next section we will break down the structure of an index template.
Our search platform is built on and compatible with version 7.x of Elasticsearch. Applications and queries need to follow Elasticsearch 7 syntax and conventions.
Breaking Down Index Templates
When broken down the index template has four components, index_patterns
, aliases
, mappings
, and settings
{
"index_patterns": ["overture-*"],
"aliases": {
},
"mappings": {
},
"settings": {
}
}
The above code snippet references the Overture Quickstart Index template, feel free to have this open as a reference, we will refer to it wherever appropriate.
Index Patterns
-
In Elasticsearch, the
index_patterns
setting specifies which indices the index template should apply to. -
"index_patterns": ["overture-*"]
means that the index template applies to any index whose name starts withoverture-
.
Aliases
Here we can define an alias for indices that use this template. Aliases are a secondary and more generalized index name typically used to group related indices.
-
Here we define our alias as
file_centric
providing us some context on the method of indexing configured with Maestro:"aliases": {
"file_centric": {}
},
Settings
The settings section is for configuring index behavior. Each setting plays a role in defining how data is indexed, stored, and queried in Elasticsearch, optimizing performance and scalability based on specific use cases and requirements.
Most of these will be automatically configured by Maestro. However, we will outline the additional settings used in our index template including analyzers
, filters
, and tokenizers
, essential components in Elasticsearch that contribute to how text data is indexed, analyzed, and searched.
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzed": {
"filter": ["lowercase", "edge_ngram"],
"tokenizer": "standard"
},
"autocomplete_prefix": {
"filter": ["lowercase", "edge_ngram"],
"tokenizer": "keyword"
},
"lowercase_keyword": {
"filter": ["lowercase"],
"tokenizer": "keyword"
}
},
"filter": {
"edge_ngram": {
"max_gram": "20",
"min_gram": "1",
"side": "front",
"type": "edge_ngram"
}
}
},
"index.max_result_window": 300000,
"index.number_of_shards": 3
}
-
Analyzers are a combination of tokenizer and optional filters that processes text during indexing and searching in Elasticsearch.
-
Filters in Elasticsearch are specific processing steps, typically used for text transformations within an analyzer or independently applied during indexing or querying. Filters are used to improve search relevance and efficiency by modifying or discarding certain tokens.
-
Tokens and Tokenizers are fundamental concepts in text processing and search indexing. Tokens are individual units of text generated during tokenization. Tokenizers are components responsible for breaking down text into tokens (tokenization). They define how text is segmented based on rules like whitespace, punctuation, or specific character patterns.
For more informationFor more information on tokenizers, analyzers and filters refer to Elasticsearch's documentation on the Anatomy of analyzers.
With this information, let's break down some of the settings found in our index template
"lowercase_keyword": {
"filter": ["lowercase"],
"tokenizer": "keyword"
}
The above section defines an analyzer named lowercase_keyword
. It uses the "keyword"
tokenizer, which indexes the entire input as a single token. The "lowercase"
filter is applied to convert all tokens to lowercase, enabling case-insensitive search.
"filter": {
"edge_ngram": {
"max_gram": "20",
"min_gram": "1",
"side": "front",
"type": "edge_ngram"
}
The above filter configuration is independent of an analyzer and defines an "edge_ngram"
filter. This filter generates edge n-grams (substrings of specified lengths) from the beginning ("side": "front"
) of tokens during indexing. It facilitates partial matching and autocomplete functionality by creating prefixes of varying lengths ("min_gram"
to "max_gram"
).
For more details on the edge_ngram filter and its parameters, refer to the Elasticsearch documentation on Edge NGram Token Filters.
The edge_ngram
filter defined above is used in the analyzer shown below:
"analyzer": {
"autocomplete_analyzed": {
"filter": ["lowercase", "edge_ngram"],
"tokenizer": "standard"
},
Here, autocomplete_analyzed
is defined with a Tokenizer
using the standard tokenizer. The tokenizer divides text into tokens based on word boundaries. The filter
then makes all text lowercase and applies the "edge_ngram"
filter. This analyzer configuration is particularly useful for implementing autocomplete features, allowing users to search with partial terms and matching results based on the generated N-grams.
Mappings
The mappings section defines the structure of the documents in our index. Each field should relate to the data collected by Song and indexed by Maestro, this mapping is needed by Elasticsearch and Arranger to use our data.
The following is a summary of the basic units of an index mapping:
"mappings": {
"properties": {
"analysis": {
"properties": {
"analysisStateHistory": {
"type": "nested",
"properties": {
"initialState": { "type": "keyword" },
"updatedState": { "type": "keyword" },
"updatedAt": { "type": "date" }
}
}
}
}
}
-
Properties: Each field within the mapping section defines a specific attribute or property of the documents. These properties describe the structure and type of data that can be indexed.
-
Field Names: Field names such as
analysisStateHistory
denote individual attributes within documents. They represent specific data points that help organize and categorize information. -
Types: Each field in your documents should have a defined data type to ensure Elasticsearch understands how to index and query your data effectively. A summary of common types are provided below:
Type Description keyword
Exact value fields not intended for full-text search. Ideal for fields like IDs, keywords, enums. text
Full-text fields used for search and indexing. Analyzed by default to support partial matches. integer
Integer numeric fields for storing whole numbers. Useful for numerical operations and aggregations. date
Date/time fields that support date formats and time zones. Allows for date-based querying and sorting. -
Nested Types are used to model hierarchical data structures within documents. They allow for the nesting of objects or arrays within a single field. This is particularly useful when dealing with entities that have multiple properties or attributes, such as
analysisStateHistory
shown above. -
copy_to & file_autocomplete: The
copy_to
parameter in Elasticsearch allows you to copy the values of one or more fields into a designated target field. This feature is useful when you want to create a single field that aggregates content from multiple other fields.
"mappings": {
"properties": {
"object_id": { "type": "keyword", "copy_to": ["file_autocomplete"] },
"study_id": { "type": "keyword" },
"data_type": { "type": "keyword" },
"file_type": { "type": "keyword" },
"file_access": { "type": "keyword" },
"file_autocomplete": {
"type": "keyword",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "autocomplete_analyzed",
"search_analyzer": "lowercase_keyword"
},
"lowercase": {
"type": "text",
"analyzer": "lowercase_keyword"
},
"prefix": {
"type": "text",
"analyzer": "autocomplete_prefix",
"search_analyzer": "lowercase_keyword"
}
}
}
}
}
The object_id
field here is defined as the "type": "keyword"
and includes a copy_to
parameter pointing to file_autocomplete
.
-
This means that the value of
object_id
will be copied into thefile_autocomplete
field -
The
file_autocomplete
field is initially defined as"type": "keyword"
and includes the sub-fields (analyzed
,lowercase
, andprefix
) using the three text analyzers (autocomplete_analyzed
,autocomplete_prefix
,lowercase_keyword
) configured in our settings sections. -
These sub-fields enable different types of searches and queries on the copied content from
object_id
.
By aggregating relevant metadata into file_autocomplete
this setup enables users to search across multiple fields related to any given object_id
.
Updating an Index Template
-
Prepare your Template File: Start by preparing your Elasticsearch index template file with your desired mappings and settings. Ensure this reflects the structure and configuration specific to your data model being used in Song.
-
Update the Index Template: Use the following
curl
command to update Elasticsearch with your index template. The values here reflect those of the Overture Quickstart.curl -u elastic:myelasticpassword -X PUT 'http://elasticsearch:9200/_template/index_template' -H 'Content-Type: application/json' -d @/directory/to/your/index_template.json
- Replace
elastic:myelasticpassword
with your Elasticsearch username and password combination - Adjust the URL (
http://elasticsearch:9200/_template/index_template
) to match your Elasticsearch host - Ensure that
-d @/directory/to/your/index_template.json
points to the correct location of your template file
- Replace
-
Create a new Alias: Use the following
curl
command to create an alias for your index:curl -u elastic:myelasticpassword -X PUT 'http://elasticsearch:9200/{index-name}/_alias/file_centric' -H 'Content-Type: application/json' -d '{"is_write_index": true}'
- Replace
elastic:myelasticpassword
with your Elasticsearch username and password combination - Adjust the URL (
http://elasticsearch:9200/{index-name}/_alias/file_centric
) to match your Elasticsearch host, index name, and alias name (file_centric
). {"is_write_index": true}
sets the alias to be the write index meaning new documents submitted to this alias through indexing operations (Maestro) will be routed to this index
- Replace
Quickstart Index
The index mapping template used within the Quickstart can be found in the directory linked here
Upon initial deployment, this index template is uploaded to Elasticsearch, and a new alias, overture-quickstart-index
, is created in alignment with the defined "index_patterns": ["overture-*"]"
. The Docker container and associated script that automate this process is located in the docker-compose.yml here.
If you change this template to match your own custom schema, and wish to populate Elasticsearch with your new index template, ensure you check the following:
-
Stage Arranger Variables: Ensure the environment variables in Stage are adjusted to specify the updated documentType and Index name according to your modified Elasticsearch mapping
-
Maestro Elasticsearch Variables: Maestro requires specific information about the alias and centrality of the index. Make sure these variables are updated
-
Arranger Configuration Files: Update Arranger configuration files to reflect any changes made to your index mapping. This ensures that Arranger, the front-end search interface, correctly interprets and interacts with your Elasticsearch data
Updating Arranger Configuration FilesFor detailed instructions on customizing the search interface in Arranger to align with your updated mapping, refer to the next section on search interface customization.