MuhammadLab
AI search and NLP

Document Retrieval Interactive Demo

Build a miniature search engine in the browser. Edit the document collection, ask a query, and see how retrieval ranks the most relevant documents using TF-IDF and cosine similarity.

Browser-basedTF-IDFCosine similaritySearch rankingStudent demo

Top retrieved document

Retrieval Augmented Generation

27% similarity for the current query.

4

Documents

65

Terms

6

Query tokens

1. Enter the query

The query is converted into a vector and compared with each document.

6

Query words

75

Corpus words

65

Vocabulary

2. Edit the document collection

Each document becomes a searchable vector.

1
2
3
4

Ranked retrieval results

Higher cosine similarity means the query vector points in a similar direction to the document vector.

1

Retrieval Augmented Generation

Retrieval augmented generation first retrieves relevant context from a document collection. A language model can then use that context to answer questions with better grounding.

27%
modelretrievesrelevant
2

Document Retrieval

Document retrieval is the process of finding the most relevant documents for a user query. A retrieval system preprocesses text, builds an index, scores query-document similarity, and ranks the results.

21%
relevantdocumentsquery
3

Transformer Search

Modern AI search systems may use transformer embeddings to represent queries and documents. Similar meanings can be retrieved even when the exact words are different.

15%
transformerdocuments
4

Text Classification

Text classification assigns labels to text, such as topic, sentiment, intent, or toxicity. It predicts a category rather than returning a ranked list of documents.

4%
documents

Document term table

The highest weighted terms explain what each document is mostly about.

Document Retrieval

retrieval

TF 0.1 x IDF 1.511

0.151

builds

TF 0.05 x IDF 1.916

0.096

finding

TF 0.05 x IDF 1.916

0.096

index

TF 0.05 x IDF 1.916

0.096

most

TF 0.05 x IDF 1.916

0.096

preprocesses

TF 0.05 x IDF 1.916

0.096

Transformer Search

ai

TF 0.053 x IDF 1.916

0.101

different

TF 0.053 x IDF 1.916

0.101

embeddings

TF 0.053 x IDF 1.916

0.101

even

TF 0.053 x IDF 1.916

0.101

exact

TF 0.053 x IDF 1.916

0.101

may

TF 0.053 x IDF 1.916

0.101

Text Classification

text

TF 0.111 x IDF 1.511

0.168

assigns

TF 0.056 x IDF 1.916

0.106

category

TF 0.056 x IDF 1.916

0.106

classification

TF 0.056 x IDF 1.916

0.106

intent

TF 0.056 x IDF 1.916

0.106

labels

TF 0.056 x IDF 1.916

0.106

Retrieval Augmented Generation

context

TF 0.111 x IDF 1.916

0.213

answer

TF 0.056 x IDF 1.916

0.106

augmented

TF 0.056 x IDF 1.916

0.106

better

TF 0.056 x IDF 1.916

0.106

collection

TF 0.056 x IDF 1.916

0.106

first

TF 0.056 x IDF 1.916

0.106
1

Text input

A small document collection and a user query are entered.

2

Tokenisation

Text is split into searchable terms, and common stop words can be removed.

3

Indexing

The system builds a vocabulary and records which documents contain each term.

4

TF-IDF weighting

Terms become stronger when they are frequent in one document but rare across the corpus.

5

Similarity scoring

The query vector is compared with each document vector using cosine similarity.

6

Ranking

Documents are sorted so the most relevant result appears first.

7

Inspection

Students inspect matching terms, scores, and calculations to understand why a result ranked highly.

8

Retrieval for AI

Modern AI systems often retrieve context before answering questions or generating responses.

What document retrieval means

Document retrieval is the process of finding the most relevant documents for a query. Search engines, library systems, research databases, and retrieval augmented generation systems all depend on retrieval.

Why TF-IDF works

TF-IDF rewards terms that appear often in one document but not everywhere. This helps the system find distinctive words such as transformer, Android, genomics, or validation.

Learning outcomes

  • Understand query-document matching.
  • Read TF, IDF, TF-IDF, and cosine scores.
  • Explain why a document ranked first.
  • Connect retrieval to search and RAG systems.

Limitations

What students should remember

Classic keyword retrieval can miss documents that use different words with similar meaning.
A high similarity score means strong term overlap, not guaranteed truth or quality.
Stop-word removal and tokenisation choices can change rankings.
Modern semantic retrieval often uses embeddings, but the same ranking idea still matters.
Retrieval systems can reflect bias in the document collection and ranking method.
For academic, legal, medical, or forensic work, retrieved documents should be verified manually.