N-gram Substring Matching in Tantivy

By Thomas Sileo • January 18, 2026 • 5 min read

Gotchas using n-gram tokenizers.

I am using tantivy to power bigbrowser’s search, and I wanted substring matching over indexed URLs.

A search for github should match https://github.com/tsileo/repo.

I ended up using Tantivy’s n-gram tokenizer but ran into several subtle issues that I think are worth sharing.

N-gram 101

Most tokenizers will split text on word boundaries (or stems). The URL https://github.com/tantivy-search/tantivy is turned into tokens like https, github, com, tantivy, search.

N-gram tokenization breaks text into overlapping character sequences. With min_gram=3 and max_gram=8:

Input: "github"
Output: "git", "gith", "githu", "github", "ith", "ithu", "ithub", "thu", "thub", "hub"

With this solution, searching for hub will match, as the query term hub exists as an indexed token.

Here’s how to set it up in Tantivy:

use tantivy::tokenizer::{NgramTokenizer, TextAnalyzer, LowerCaser};

let ngram_tokenizer = TextAnalyzer::builder(NgramTokenizer::new(3, 8, false)?)
    .filter(LowerCaser)
    .build();

index.tokenizers().register("ngram", ngram_tokenizer);

Gotcha #1: Double n-gramming

After adding a n-gram field for URLs, I realized almost all queries ended up with some URL substring matches, generating a lot of noise.

// Building a query for the url_ngram field
let tokenizer = index.tokenizer_for_field(url_ngram_field)?;
let mut stream = tokenizer.token_stream(query_text);

// Collect tokens and build term queries for each...

For most tokenizers, you want to run the query through the same tokenizer to have the query terms matching what’s indexed.

For instance, using a stemmer where running is converted to run, you do want to stem query terms before with the same process.

But n-gram indexes are a different beast, re-using the n-gram tokenizer ends up generating too many matches.

Note that I am building my own query parser, but using the default QueryParser would result in the same behavior, as it automatically runs query terms through the field tokenizer under the hood.

With my first implementation, searching for elastic would match any URL containing plastic because they share trigrams like las, ast, sti, tic.

Now this behavior may be useful for other use cases, but for powering a substring matching, it results in a lot of noise/false positives.

The Fix

Don’t n-gram the query. Use the full search term as a single token:

use tantivy::{
    query::{BooleanQuery, Occur, TermQuery},
    schema::IndexRecordOption,
    Index, Term,
};

// WRONG: Tokenizing the query with the field's n-gram tokenizer
fn search_ngram_wrong(index: &Index, field: Field, query_text: &str) -> Box<dyn Query> {
    let mut tokenizer = index.tokenizer_for_field(field).unwrap();
    let mut token_stream = tokenizer.token_stream(query_text);
    
    let mut term_queries: Vec<(Occur, Box<dyn Query>)> = Vec::new();
    while token_stream.advance() {
        let token = token_stream.token().text.to_string();
        let term = Term::from_field_text(field, &token);
        let query = TermQuery::new(term, IndexRecordOption::WithFreqsAndPositions);
        term_queries.push((Occur::Should, Box::new(query)));
    }
    // "elastic" becomes ["ela", "elas", "las", "ast", "sti", "tic", ...]
    // Matches "plastic" because it shares "las", "ast", "sti", "tic"
    Box::new(BooleanQuery::new(term_queries))
}

// CORRECT: Use the query term as-is (single term lookup)
fn search_ngram_correct(field: Field, query_text: &str) -> Box<dyn Query> {
    // Don't tokenize! The index already contains all substrings.
    // Just look up the query term directly.
    let term = Term::from_field_text(field, &query_text.to_lowercase());
    Box::new(TermQuery::new(term, IndexRecordOption::WithFreqsAndPositions))
    // "elastic" stays as "elastic", only matches if "elastic" is an indexed n-gram
}

The query term elastic is now searched as a single token against the indexed n-grams. It matches only if elastic was indexed—meaning the original text contained elastic.

Gotcha #2: Queries Longer Than max_gram

With max_gram=8, what happens when I search for elasticsearch (13 characters)?

The term elasticsearch won’t be in the index as the longest indexed n-gram is 8 characters.

For example, this edge case is documented for Elasticsearch itself:

“Search terms longer than the max_gram length may not match any indexed terms.”

Solutions

Option 1: Increase max_gram

Simple but bloats the index considerably. Each additional character in max_gram adds roughly N new tokens per indexed value, where N is the average text length.

Option 2: Truncate the query

Cut the query to max_gram length. That’s one of the solutions provided by Elasticsearch but it would not help with false positives.

Option 3: Overlapping chunks

Split the query into overlapping chunks of max_gram size and combining them into a boolean query:

let chunks: Vec<String> = query
    .char_indices()
    .filter_map(|(i, _)| {
        let end = i + max_gram;
        if end <= query.len() {
            Some(query[i..end].to_string())
        } else {
            None
        }
    })
    .collect();

If all chunks match, the full substring likely exists. This is more complex but more precise.

While doing some research, I found that another database, Milvus, is advertising this behavior too in their doc.

I picked option 3 as it seems the “smartest” way to handle the query > max_gram case:

Keep the index small with the current min/max gram values
Requiring all overlapping chunks to match maintains precision/prevents false positives