MuhammadLab
NLP text processing

NLP Text Preprocessing Tool

Paste research text, lecture notes, logs, or short documents and watch the full preprocessing pipeline unfold step by step: normalization, tokenization, stop words, stemming, lemmatization, n-grams, entities, and frequency analysis.

Uses compromise NLPLocal browser processingExamples included

Current text

NLP lecture example

2

Sentences

26

Raw tokens

23

Useful tokens

Everything runs in your browser. Text is not uploaded to MuhammadLab servers.

Input

Load or paste text

Best for short teaching examples, abstracts, case notes, CSV snippets, log excerpts, and classroom demos.

204 / 12,000 characters.

204

Characters

2

Sentences

26

Raw tokens

23

Useful tokens

25

Unique raw terms

22

Unique useful terms

0

Entities

22

Bigrams

1

Original text

The raw text before any transformation. In real NLP work, this might come from notes, reports, logs, abstracts, or exported records.

After

Natural Language Processing helps computers understand human language. Students often clean text by lowercasing, removing stop words, generating n-grams, and building features for machine learning models.

2

Sentence segmentation

Split text into sentence-level units so downstream models can analyze smaller chunks.

Natural Language Processing helps computers understand human language.Students often clean text by lowercasing, removing stop words, generating n-grams, and building features for machine learning models.

2 sentences detected with compromise.

3

Lowercasing and whitespace normalization

Lowercasing reduces duplicate vocabulary such as Model/model/MODEL. Whitespace cleanup prevents accidental empty tokens.

Before

Natural Language Processing helps computers understand human language. Students often clean text by lowercasing, removing stop words, generating n-grams, and building features for machine learning models.

After

natural language processing helps computers understand human language. students often clean text by lowercasing, removing stop words, generating n-grams, and building features for machine learning models.

4

Punctuation removal

Remove punctuation for bag-of-words features while keeping word characters and simple apostrophes/hyphens.

Before

natural language processing helps computers understand human language. students often clean text by lowercasing, removing stop words, generating n-grams, and building features for machine learning models.

After

natural language processing helps computers understand human language students often clean text by lowercasing removing stop words generating n-grams and building features for machine learning models

5

Tokenization

Break the cleaned text into tokens. Tokens are the building blocks for counts, TF-IDF, classifiers, and many classic NLP models.

naturallanguageprocessinghelpscomputersunderstandhumanlanguagestudentsoftencleantextbylowercasingremovingstopwordsgeneratingn-gramsandbuildingfeaturesformachinelearningmodels

26 raw tokens.

6

Stop word removal

Remove very common words so content-bearing terms become easier to see.

naturallanguageprocessinghelpscomputersunderstandhumanlanguagestudentsoftencleantextlowercasingremovingstopwordsgeneratingn-gramsbuildingfeaturesmachinelearningmodels

3 stop words removed.

7

Stemming

Approximate words to rough roots using suffix rules. This is fast and explainable, but it can create non-dictionary roots.

Before

natural language processing helps computers understand human language students often clean text lowercasing removing stop words generating n-grams building features machine learning models

After

natural language process help computer understand human language student often clean text lowercas remov stop word generat n-gram build feature machine learn model

8

Lemmatization

Convert words toward dictionary forms. This tool uses a small teaching dictionary plus compromise noun/verb normalization.

Before

natural language processing helps computers understand human language students often clean text lowercasing removing stop words generating n-grams building features machine learning models

After

natural language process help computer understand human language student often clean text lowercase remove stop word generate n-gram build feature machine learn model

9

N-grams

Create adjacent word sequences. Bigrams and trigrams capture short context that single tokens miss.

natural languagelanguage processingprocessing helpshelps computerscomputers understandunderstand humanhuman languagelanguage studentsstudents oftenoften cleanclean texttext lowercasinglowercasing removingremoving stopstop wordswords generatinggenerating n-gramsn-grams buildingnatural language processinglanguage processing helpsprocessing helps computershelps computers understandcomputers understand humanunderstand human languagehuman language studentslanguage students oftenstudents often cleanoften clean textclean text lowercasingtext lowercasing removing

22 bigrams and 21 trigrams generated.

Teaching flow

Use preprocessing before TF-IDF, classifiers, embeddings, and search demos.

This page is a good first stop before the TF-IDF tool because students can see exactly how text becomes tokens and features before scoring, ranking, or model training.

Continue to TF-IDF

How to use in class

Suggested lecture sequence

1. Start with the messy social text to show why preprocessing matters.

2. Switch to a research abstract and compare raw tokens with useful tokens.

3. Explain stemming versus lemmatization using the before/after panels.

4. Move into TF-IDF and show how cleaned tokens become ranking features.