What are Inverted Indices?
What are Inverted Indices?
Back when Laith and I were creating the Nera search engine, we dove into a world of technicality that neither of us had seen in our years of college. There were many backend pieces to the puzzle, and LLMs were definitely involved. We found that to build a great search, usually you want to choose technologies that complement LLMs, they are not the only one and done solution.
One of these puzzle pieces was the inverted index.
What is an Inverted Index?
An inverted index is a data structure that maps each unique term—think words or phrases—to the specific documents (and often positions within those documents) where it appears. By organizing data this way, we could dramatically speed up full-text searches, making it possible to quickly retrieve relevant results from a massive collection of documents.
Think of it like the index at the back of a textbook:
- Instead of reading every page to find mentions of "neural networks"
- You flip to the index, find "neural networks", and it tells you: pages 42, 87, 103
An inverted index does this for search engines.
How We Built It for Nera
Step 1: Data Collection
We started by scraping the internet, focusing primarily on local websites for cities like Waco to gather information about things to do. We collected this raw data—think event listings, attraction descriptions, or activity guides—and converted it into individual documents.
Step 2: Text Preprocessing
Then, to prepare the text for indexing, we ran it through a stopword filter, stripping out common words like 'the,' 'and,' or 'is' that don't carry much meaning for search purposes.
For example:
"The museum is open on Saturday"
→ ["museum", "open", "Saturday"]
Step 3: Building the Index
Next, for each unique term that was left after filtering (say, 'museum' or 'hiking' from a Waco activities page), the inverted index records where that term appears across all our documents.
It is implemented as a hash table, with:
- Keys: Filtered terms
- Values: Lists containing document IDs and positions within the documents
Visual Example
Here's a simplified view:
Term | Document IDs (with positions)
-------------|--------------------------------
"museum" | [doc1: [5, 23], doc5: [12], doc8: [3]]
"hiking" | [doc2: [7], doc4: [15, 22]]
"restaurant" | [doc3: [9], doc5: [18], doc7: [2]]
"coffee" | [doc1: [31], doc3: [14], doc6: [8]]
How Search Works
When a user searches for "museum coffee":
- Tokenize query:
["museum", "coffee"]
- Look up each term in the inverted index
- Find documents containing both terms:
- "museum" → docs 1, 5, 8
- "coffee" → docs 1, 3, 6
- Intersection → doc 1 (contains both!)
- Rank results (by relevance, position, frequency, etc.)
- Return top results
This is O(k) where k is the number of terms in the query, not O(n) where n is all documents. Massive speedup!
Why It's Powerful
Speed
- Without inverted index: O(n × m) - scan every document for every term
- With inverted index: O(k) - just lookup terms in hash table
For thousands of documents, this is the difference between seconds and milliseconds.
Flexibility
You can extend inverted indices with:
- Term frequency (TF): How often does "museum" appear in doc 1?
- Inverse document frequency (IDF): How rare is "museum" across all docs?
- Position information: For phrase search ("museum of art")
- Metadata: Category, date, location tags
Foundation of Search Engines
Almost every search engine uses inverted indices:
- Elasticsearch: Built on Apache Lucene's inverted index
- Google: Massive distributed inverted indices
- Database full-text search: PostgreSQL's GIN index, MongoDB text indexes
Implementation in Nera
Here's a simplified version of how we built it:
from collections import defaultdict
class InvertedIndex:
def __init__(self):
self.index = defaultdict(list)
self.stopwords = {"the", "is", "and", "or", "in", "on"}
def tokenize(self, text):
words = text.lower().split()
return [w for w in words if w not in self.stopwords]
def add_document(self, doc_id, text):
tokens = self.tokenize(text)
for position, token in enumerate(tokens):
self.index[token].append((doc_id, position))
def search(self, query):
tokens = self.tokenize(query)
if not tokens:
return []
# Find documents containing all query terms
result_docs = set()
for i, token in enumerate(tokens):
docs = {doc_id for doc_id, pos in self.index.get(token, [])}
if i == 0:
result_docs = docs
else:
result_docs &= docs # Intersection
return list(result_docs)
# Usage
index = InvertedIndex()
index.add_document(1, "The museum is open on Saturday")
index.add_document(2, "Great hiking trails near the park")
index.add_document(3, "Coffee shop with museum views")
results = index.search("museum coffee") # Returns [1, 3]
Challenges We Faced
1. Scale
As we added more cities, the index grew huge. We needed to:
- Shard the index across multiple servers
- Compress posting lists (document ID lists)
- Use efficient encoding (variable-byte encoding)
2. Updates
Adding new documents meant rebuilding parts of the index. Solutions:
- Segment-based approach (like Lucene)
- Periodic batch updates
- Delta indices for recent changes
3. Relevance
Not all matches are equal. We added:
- TF-IDF scoring: Rare terms matter more
- BM25 ranking: Modern relevance algorithm
- Geo-proximity: Boost nearby results
- Freshness: Recent events ranked higher
And That's It!
Inverted indices are one of those "simple but profound" data structures. Once you understand them, you realize they're everywhere in search, databases, and information retrieval.
Building Nera taught me that great search isn't just about LLMs and embeddings—it's about choosing the right combination of classical CS data structures and modern ML techniques.
If you're building any kind of search or text analysis system, understanding inverted indices is essential. They're fast, flexible, and battle-tested at scale.
Cool right?
Want to dive deeper? Check out: