Document Retrieval Interactive Demo
Build a miniature search engine in the browser. Edit the document collection, ask a query, and see how retrieval ranks the most relevant documents using TF-IDF and cosine similarity.
Top retrieved document
Retrieval Augmented Generation
27% similarity for the current query.
4
Documents
65
Terms
6
Query tokens
1. Enter the query
The query is converted into a vector and compared with each document.
6
Query words
75
Corpus words
65
Vocabulary
2. Edit the document collection
Each document becomes a searchable vector.
Ranked retrieval results
Higher cosine similarity means the query vector points in a similar direction to the document vector.
Retrieval Augmented Generation
Retrieval augmented generation first retrieves relevant context from a document collection. A language model can then use that context to answer questions with better grounding.
Document Retrieval
Document retrieval is the process of finding the most relevant documents for a user query. A retrieval system preprocesses text, builds an index, scores query-document similarity, and ranks the results.
Transformer Search
Modern AI search systems may use transformer embeddings to represent queries and documents. Similar meanings can be retrieved even when the exact words are different.
Text Classification
Text classification assigns labels to text, such as topic, sentiment, intent, or toxicity. It predicts a category rather than returning a ranked list of documents.
Document term table
The highest weighted terms explain what each document is mostly about.
Document Retrieval
retrieval
TF 0.1 x IDF 1.511
builds
TF 0.05 x IDF 1.916
finding
TF 0.05 x IDF 1.916
index
TF 0.05 x IDF 1.916
most
TF 0.05 x IDF 1.916
preprocesses
TF 0.05 x IDF 1.916
Transformer Search
ai
TF 0.053 x IDF 1.916
different
TF 0.053 x IDF 1.916
embeddings
TF 0.053 x IDF 1.916
even
TF 0.053 x IDF 1.916
exact
TF 0.053 x IDF 1.916
may
TF 0.053 x IDF 1.916
Text Classification
text
TF 0.111 x IDF 1.511
assigns
TF 0.056 x IDF 1.916
category
TF 0.056 x IDF 1.916
classification
TF 0.056 x IDF 1.916
intent
TF 0.056 x IDF 1.916
labels
TF 0.056 x IDF 1.916
Retrieval Augmented Generation
context
TF 0.111 x IDF 1.916
answer
TF 0.056 x IDF 1.916
augmented
TF 0.056 x IDF 1.916
better
TF 0.056 x IDF 1.916
collection
TF 0.056 x IDF 1.916
first
TF 0.056 x IDF 1.916
Text input
A small document collection and a user query are entered.
Tokenisation
Text is split into searchable terms, and common stop words can be removed.
Indexing
The system builds a vocabulary and records which documents contain each term.
TF-IDF weighting
Terms become stronger when they are frequent in one document but rare across the corpus.
Similarity scoring
The query vector is compared with each document vector using cosine similarity.
Ranking
Documents are sorted so the most relevant result appears first.
Inspection
Students inspect matching terms, scores, and calculations to understand why a result ranked highly.
Retrieval for AI
Modern AI systems often retrieve context before answering questions or generating responses.
What document retrieval means
Document retrieval is the process of finding the most relevant documents for a query. Search engines, library systems, research databases, and retrieval augmented generation systems all depend on retrieval.
Why TF-IDF works
TF-IDF rewards terms that appear often in one document but not everywhere. This helps the system find distinctive words such as transformer, Android, genomics, or validation.
Learning outcomes
- Understand query-document matching.
- Read TF, IDF, TF-IDF, and cosine scores.
- Explain why a document ranked first.
- Connect retrieval to search and RAG systems.
Limitations