AI search and NLP

Document Retrieval Interactive Demo

Build a miniature search engine in the browser. Edit the document collection, ask a query, and see how retrieval ranks the most relevant documents using TF-IDF and cosine similarity.

Browser-basedTF-IDFCosine similaritySearch rankingStudent demo

Retrieval Augmented Generation

27% similarity for the current query.

Documents

Terms

Query tokens

1. Enter the query

The query is converted into a vector and compared with each document.

Remove common stop wordsRemoves words such as the, and, is, to make important terms stronger.

Query words

Corpus words

Vocabulary

2. Edit the document collection

Each document becomes a searchable vector.

Ranked retrieval results

Higher cosine similarity means the query vector points in a similar direction to the document vector.

Retrieval Augmented Generation

Retrieval augmented generation first retrieves relevant context from a document collection. A language model can then use that context to answer questions with better grounding.

27%

modelretrievesrelevant

Document Retrieval

Document retrieval is the process of finding the most relevant documents for a user query. A retrieval system preprocesses text, builds an index, scores query-document similarity, and ranks the results.

21%

relevantdocumentsquery

Transformer Search

Modern AI search systems may use transformer embeddings to represent queries and documents. Similar meanings can be retrieved even when the exact words are different.

15%

transformerdocuments

Text Classification

Text classification assigns labels to text, such as topic, sentiment, intent, or toxicity. It predicts a category rather than returning a ranked list of documents.

documents

Calculation example

For the top document, cosine similarity compares the query vector and document vector.

cosine(q, d) = dot(q, d) / (||q|| x ||d||)

0.0891 / (0.7162 x 0.4577) = 0.2719

A score near 1 means the query and document use strongly similar weighted terms. A score near 0 means weak similarity.

Query vector

Term	Count	IDF	Weight
transformer	1	1.916	0.319
model	1	1.916	0.319
retrieves	1	1.916	0.319
relevant	1	1.511	0.252
documents	1	1.223	0.204
query	1	1.916	0.319

Document term table

The highest weighted terms explain what each document is mostly about.

Document Retrieval

retrieval

TF 0.1 x IDF 1.511

0.151

builds

TF 0.05 x IDF 1.916

0.096

finding

TF 0.05 x IDF 1.916

0.096

index

TF 0.05 x IDF 1.916

0.096

most

TF 0.05 x IDF 1.916

0.096

preprocesses

TF 0.05 x IDF 1.916

0.096

Transformer Search

TF 0.053 x IDF 1.916

0.101

different

TF 0.053 x IDF 1.916

0.101

embeddings

TF 0.053 x IDF 1.916

0.101

even

TF 0.053 x IDF 1.916

0.101

exact

TF 0.053 x IDF 1.916

0.101

may

TF 0.053 x IDF 1.916

0.101

Text Classification

text

TF 0.111 x IDF 1.511

0.168

assigns

TF 0.056 x IDF 1.916

0.106

Retrieval Augmented Generation

context

TF 0.111 x IDF 1.916

0.213

answer

TF 0.056 x IDF 1.916

0.106

augmented

TF 0.056 x IDF 1.916

0.106

better

TF 0.056 x IDF 1.916

0.106

collection

TF 0.056 x IDF 1.916

0.106

first

TF 0.056 x IDF 1.916

0.106

Text input

A small document collection and a user query are entered.

Tokenisation

Text is split into searchable terms, and common stop words can be removed.

Indexing

The system builds a vocabulary and records which documents contain each term.

TF-IDF weighting

Terms become stronger when they are frequent in one document but rare across the corpus.

Similarity scoring

The query vector is compared with each document vector using cosine similarity.

Ranking

Documents are sorted so the most relevant result appears first.

Inspection

Students inspect matching terms, scores, and calculations to understand why a result ranked highly.

Retrieval for AI

Modern AI systems often retrieve context before answering questions or generating responses.

What document retrieval means

Document retrieval is the process of finding the most relevant documents for a query. Search engines, library systems, research databases, and retrieval augmented generation systems all depend on retrieval.

Why TF-IDF works

TF-IDF rewards terms that appear often in one document but not everywhere. This helps the system find distinctive words such as transformer, Android, genomics, or validation.

Learning outcomes

Understand query-document matching.
Read TF, IDF, TF-IDF, and cosine scores.
Explain why a document ranked first.
Connect retrieval to search and RAG systems.

Limitations

What students should remember

Classic keyword retrieval can miss documents that use different words with similar meaning.

A high similarity score means strong term overlap, not guaranteed truth or quality.

Stop-word removal and tokenisation choices can change rankings.

Modern semantic retrieval often uses embeddings, but the same ranking idea still matters.

Retrieval systems can reflect bias in the document collection and ranking method.

For academic, legal, medical, or forensic work, retrieved documents should be verified manually.