N-gram Substring Matching in Tantivy
By Thomas Sileo • • 5 min read
I am using tantivy to power bigbrowser’s search, and I wanted substring matching over indexed URLs.
A search for github should match https://github.com/tsileo/repo.
I ended up using Tantivy’s n-gram tokenizer but ran into several subtle issues that I think are worth sharing.
N-gram 101
Most tokenizers will split text on word boundaries (or stems). The URL https://github.com/tantivy-search/tantivy is turned into tokens like https, github, com, tantivy, search.
N-gram tokenization breaks text into overlapping character sequences. With min_gram=3 and max_gram=8:
Input: "github"
Output: "git", "gith", "githu", "github", "ith", "ithu", "ithub", "thu", "thub", "hub"
With this solution, searching for hub will match, as the query term hub exists as an indexed token.
Here’s how to set it up in Tantivy:
use tantivy::tokenizer::{NgramTokenizer, TextAnalyzer, LowerCaser};
let ngram_tokenizer = TextAnalyzer::builder(NgramTokenizer::new(3, 8, false)?)
.filter(LowerCaser)
.build();
index.tokenizers().register("ngram", ngram_tokenizer);
Gotcha #1: Double n-gramming
After adding a n-gram field for URLs, I realized almost all queries ended up with some URL substring matches, generating a lot of noise.
// Building a query for the url_ngram field
let tokenizer = index.tokenizer_for_field(url_ngram_field)?;
let mut stream = tokenizer.token_stream(query_text);
// Collect tokens and build term queries for each...
For most tokenizers, you want to run the query through the same tokenizer to have the query terms matching what’s indexed.
For instance, using a stemmer where running is converted to run, you do want to stem query terms before with the same process.
But n-gram indexes are a different beast, re-using the n-gram tokenizer ends up generating too many matches.
Note that I am building my own query parser, but using the default QueryParser would result in the same behavior, as it automatically runs query terms through the field tokenizer under the hood.
With my first implementation, searching for elastic would match any URL containing plastic because they share trigrams like las, ast, sti, tic.
Now this behavior may be useful for other use cases, but for powering a substring matching, it results in a lot of noise/false positives.
The Fix
Don’t n-gram the query. Use the full search term as a single token:
use tantivy::{
query::{BooleanQuery, Occur, TermQuery},
schema::IndexRecordOption,
Index, Term,
};
// WRONG: Tokenizing the query with the field's n-gram tokenizer
fn search_ngram_wrong(index: &Index, field: Field, query_text: &str) -> Box<dyn Query> {
let mut tokenizer = index.tokenizer_for_field(field).unwrap();
let mut token_stream = tokenizer.token_stream(query_text);
let mut term_queries: Vec<(Occur, Box<dyn Query>)> = Vec::new();
while token_stream.advance() {
let token = token_stream.token().text.to_string();
let term = Term::from_field_text(field, &token);
let query = TermQuery::new(term, IndexRecordOption::WithFreqsAndPositions);
term_queries.push((Occur::Should, Box::new(query)));
}
// "elastic" becomes ["ela", "elas", "las", "ast", "sti", "tic", ...]
// Matches "plastic" because it shares "las", "ast", "sti", "tic"
Box::new(BooleanQuery::new(term_queries))
}
// CORRECT: Use the query term as-is (single term lookup)
fn search_ngram_correct(field: Field, query_text: &str) -> Box<dyn Query> {
// Don't tokenize! The index already contains all substrings.
// Just look up the query term directly.
let term = Term::from_field_text(field, &query_text.to_lowercase());
Box::new(TermQuery::new(term, IndexRecordOption::WithFreqsAndPositions))
// "elastic" stays as "elastic", only matches if "elastic" is an indexed n-gram
}
The query term elastic is now searched as a single token against the indexed n-grams. It matches only if elastic was indexed—meaning the original text contained elastic.
Gotcha #2: Queries Longer Than max_gram
With max_gram=8, what happens when I search for elasticsearch (13 characters)?
The term elasticsearch won’t be in the index as the longest indexed n-gram is 8 characters.
For example, this edge case is documented for Elasticsearch itself:
“Search terms longer than the
max_gramlength may not match any indexed terms.”
Solutions
Option 1: Increase max_gram
Simple but bloats the index considerably. Each additional character in max_gram adds roughly N new tokens per indexed value, where N is the average text length.
Option 2: Truncate the query
Cut the query to max_gram length. That’s one of the solutions provided by Elasticsearch but it would not help with false positives.
Option 3: Overlapping chunks
Split the query into overlapping chunks of max_gram size and combining them into a boolean query:
let chunks: Vec<String> = query
.char_indices()
.filter_map(|(i, _)| {
let end = i + max_gram;
if end <= query.len() {
Some(query[i..end].to_string())
} else {
None
}
})
.collect();
If all chunks match, the full substring likely exists. This is more complex but more precise.
While doing some research, I found that another database, Milvus, is advertising this behavior too in their doc.
I picked option 3 as it seems the “smartest” way to handle the query > max_gram case:
- Keep the index small with the current min/max gram values
- Requiring all overlapping chunks to match maintains precision/prevents false positives
