倒排索引 (Inverted Index)

#database #indexing #inverted-index #full-text-search #system-design

為什麼需要 Inverted Index

「我有一百萬篇文件，要找所有包含單字 apple 的文件。」

如果沒索引：逐篇掃描 → O(n × 文件大小)，完全不可行。

Inverted Index 的核心概念：預先建立「單字 → 出現在哪些文件」的映射表，查詢時直接查對應的 posting list。

Forward vs Inverted

Forward Index（正排索引：文件 → 單字）

Doc1 → {apple, banana, cat}
Doc2 → {banana, dog}
Doc3 → {apple, dog, elephant}

Inverted Index（倒排索引：單字 → 文件）

apple    → [Doc1, Doc3]
banana   → [Doc1, Doc2]
cat      → [Doc1]
dog      → [Doc2, Doc3]
elephant → [Doc3]

查詢 "apple AND dog"：

apple 的 posting list = [Doc1, Doc3]
dog 的 posting list = [Doc2, Doc3]
交集 = [Doc3] ✓

資料結構兩要素

1. Term Dictionary（詞典）

所有出現過的 term 的集合，通常用 B+Tree 或 FST（Finite State Transducer）儲存以支援快速查詢。

2. Posting List（倒排表）

每個 term 對應一條 posting list，列出所有出現該 term 的文件。

每個 posting（entry）可能包含：

欄位	用途
Document ID	文件識別（必要）
Term Frequency (TF)	該 term 在文件中出現幾次，用於相關性排名（BM25、TF-IDF）
Positions	term 在文件中的位置，用於 phrase query
Payload	附加資訊（罕用）

範例：

apple → [(Doc1, TF=2, pos=[2,7]), (Doc3, TF=1, pos=[5])]

查詢類型

單詞查詢

查詢：apple
→ 直接取 apple 的 posting list
→ [Doc1, Doc3]

布林查詢（AND / OR / NOT）

查詢：apple AND dog
  apple → [Doc1, Doc3]
  dog   → [Doc2, Doc3]
  intersection → [Doc3]

查詢：apple OR banana
  apple  → [Doc1, Doc3]
  banana → [Doc1, Doc2]
  union → [Doc1, Doc2, Doc3]

Phrase Query（短語查詢）

需要 positions 資訊。

查詢："apple pie"

步驟：
1. 取 apple 的 posting list：含 positions
   apple → (Doc1, pos=[2])
           (Doc3, pos=[5,9])
2. 取 pie 的 posting list：
   pie   → (Doc1, pos=[3])
           (Doc3, pos=[12])
3. 只有當 pie 的位置 = apple 位置 + 1 才匹配
   Doc1: apple@2, pie@3 → 3 = 2+1 ✓ 匹配
   Doc3: apple@5, pie@12 → ✗；apple@9, pie@12 → ✗

結果：Doc1

模糊查詢 / 前綴查詢

透過 Term Dictionary 的 B+Tree 或 FST 結構，支援 app*、ap?le、編輯距離模糊比對。

建立流程

原始文件
    │
    ▼ Tokenization 斷詞
"Apple pies are tasty" → ["Apple", "pies", "are", "tasty"]
    │
    ▼ Normalization
- 轉小寫：Apple → apple
- 去除 stop words：are (通常濾掉)
- Stemming / Lemmatization：pies → pie
    │
    ▼
["apple", "pie", "tasty"]
    │
    ▼ 建立 posting list
apple → [(Doc1, pos=[0])]
pie   → [(Doc1, pos=[1])]
tasty → [(Doc1, pos=[3])]

優缺點

優點

優點	說明
全文檢索極快	term → posting list 是 O(1)；掃 list 線性
支援布林、短語、模糊查詢	核心資料結構就支援，不用額外設計
可做相關性排名	TF-IDF、BM25 都建立在 term frequency 上

缺點

缺點	說明
建立成本高	要做 tokenize、normalize、stemming、index 建構
儲存成本高	Posting list 可能很長，常配合壓縮（VarInt、Roaring Bitmap）
更新昂貴	新增 / 刪除文件要更新對應的所有 term 的 posting list
不適合精確結構化查詢	`WHERE age = 30` 這種用 B-Tree 更合適

Simplification-with-exceptions

邊界情況

中文分詞需要特殊處理 — 英文空白分詞，中文要用 jieba、IK Analyzer 等斷詞工具。
Near-real-time indexing — Elasticsearch 採 segment 結構（類似 LSM），新文件
寫到記憶體 segment，定期 flush + merge → 平衡寫入效能與查詢延遲（通常 refresh
interval 1 秒）。
Index 可以很大 — 大型搜尋引擎的索引常比原始資料還大幾倍，需要 sharding 到
多機（Elasticsearch 的 shard 機制）。
壓縮很重要 — Lucene 用 Frame of Reference + VarInt 壓縮 posting list；Roaring
Bitmap 用於 bitset 運算（高速交集 / 聯集）。

實際應用

系統 / 類型	用法
Elasticsearch / OpenSearch	底層 Lucene，是搜尋引擎產業標準
Apache Solr	同樣基於 Lucene
PostgreSQL FTS	`to_tsvector()` + GIN 索引（GIN = Generalized Inverted Index）
MySQL FULLTEXT	`MATCH ... AGAINST` 語法
IDE / Code Search	VSCode、Sourcegraph 找 symbol / 字串
Spam / 內容過濾	關鍵字索引

Inverted Index vs 其他索引

需求	首選
精確欄位等值 / 範圍	04-Database-Indexing/02-B-Tree-Index
等值閃電查詢	04-Database-Indexing/04-Hash-Index
地理鄰近	04-Database-Indexing/05-Geospatial-Index
關鍵字 / 全文搜尋	Inverted Index ✓
超高寫入吞吐	04-Database-Indexing/03-LSM-Tree

04-Database-Indexing/01-Indexing-Overview
04-Database-Indexing/02-B-Tree-Index — Term Dictionary 底層也常用 B+Tree
04-Database-Indexing/03-LSM-Tree — Elasticsearch 的 segment 結構類似 LSM
04-Database-Indexing/Practice-Database-Indexing