Toxicity / Hate Speech Detection
Type a sentence and see how a pre-trained AI model flags potentially harmful language categories.
Interactive moderation lab
Input your text
This demo runs a pre-trained moderation model in your browser. It predicts harmful-language categories and applies a threshold to decide what gets flagged.
Ready to analyse
Note: the model does not “understand” meaning like a human. It estimates probabilities based on patterns it learned during training.
Safe examples
One click to try
These are safe, non-abusive examples so students can learn how thresholding affects decisions.
Model status
Idle
Loading the model...
Overall result
Run analysis to see results
Highest-risk
—
Category with the highest probability
Highest score
--
Threshold: 85%
Threshold
Confidence cut-off: 85%
At the current threshold, categories above this value are flagged.
Category results
What the model flagged
The table shows category probabilities and how they compare to your current threshold.
What each category means
Category meanings and why flags can happen
Different labels represent different kinds of potentially harmful language. The same sentence can trigger multiple categories.
False positives
Harmless text gets flagged
False negatives
Harmful text gets missed
Content moderation concepts
How decisions are made
Toxicity detection is a text classification task where an AI model predicts whether language may be harmful, abusive, insulting, threatening, or otherwise unsafe.
A threshold converts probability scores into decisions. If a category score is above the threshold, the tool flags that category. Changing the threshold changes how sensitive the moderation system is.
At the current threshold, categories above this value are flagged. Lower thresholds can increase false positives, while higher thresholds can increase false negatives.
Step-by-step pipeline
How toxicity detection turns text into decisions
This is a learning view: it makes the flow of inference and thresholding explicit.
Text input
Students type/paste a sentence. The demo uses your text only for local inference.
Tokenisation
The model converts text into tokens (small pieces) so it can work with language patterns.
Numerical representation
Tokens become numbers using the model’s learned representation space.
Model analyses patterns
The model estimates how likely the text matches harmful-language patterns it learned during training.
Category probabilities
For each harmful category, the model outputs a probability score.
Threshold is applied
Your chosen threshold converts probability scores into flagged/not-flagged decisions.
Flagged categories
Categories above the threshold are shown as flagged in the dashboard.
Human review
In real moderation, humans check context before acting on AI flags.
How toxicity detection works
From labels to flags (and why context matters)
The model has been trained on labelled text examples. It predicts probabilities for harmful-language categories based on language patterns it learned during training.
The demo does not truly understand intent, humour, sarcasm, or social context like a human. That’s why human review is important in real moderation systems.
Bias & fairness
Responsible AI, not “automatic truth”
Models can reflect bias in their training data. That means moderation outputs can vary across dialects, identities, or discussion contexts.
Why bias can appear
- Training examples may be unevenly labeled across communities.
- Reclaimed language and cultural context can be misunderstood.
- Dialect, slang, and writing style can change probabilities.
- Humour, sarcasm, and quotations can look like toxicity.
What students should do
- Treat scores as signals, not verdicts.
- Test across varied examples.
- Prefer human review for important decisions.
- Look for false positives and false negatives.
Human review
Context still matters
Moderation AI can help prioritise review, but it should not automatically punish users without context.
Humans can consider
- Intent and purpose (education, reporting, debate).
- Quote context (what is being discussed vs directed at someone).
- Satire and cultural meaning.
- Whether the message is asking for help or encouraging harm.
Treat this demo as decision support
The model output is a learning tool. In real deployments, human oversight is required for safety, fairness, and accountability.
Limitations & ethics
Use this demo for learning only
- The model can make mistakes. Confidence is not the same as truth.
- Sarcasm, humour, slang, and context are difficult for models.
- Models can be biased based on training data and labeling practices.
- Harmless educational or quoted content can still be flagged.
- Some harmful text can be missed.
- Do not use this demo for real disciplinary, legal, workplace, academic, or safety-critical decisions.
- This demo is for teaching and understanding thresholds, not final judgment.
Student learning outcomes
What you will learn
- Understand toxicity detection as text classification.
- Understand moderation categories (toxicity, insults, threats, identity attacks, etc.).
- Interpret probability scores and threshold decisions.
- Recognise false positives and false negatives.
- Understand bias and fairness limitations.
- Learn why human review still matters.
- See how pre-trained models can run in the browser.
Quick reminder
The model performs inference only. It does not learn from your input, and your text is not uploaded to a server.