MuhammadLab
AI ToolsBrowser-basedText classificationContent moderationThresholdsResponsible AI

Toxicity / Hate Speech Detection

Type a sentence and see how a pre-trained AI model flags potentially harmful language categories.

Interactive moderation lab

Input your text

This demo runs a pre-trained moderation model in your browser. It predicts harmful-language categories and applies a threshold to decide what gets flagged.

Ready to analyse

0 words0 characters

Note: the model does not “understand” meaning like a human. It estimates probabilities based on patterns it learned during training.

Safe examples

One click to try

These are safe, non-abusive examples so students can learn how thresholding affects decisions.

Model status

Idle

Loading the model...

Overall result

Run analysis to see results

Highest-risk

Category with the highest probability

Highest score

--

Threshold: 85%

The overall decision is based on whether any category score is above the threshold. This is decision support, not a replacement for human context.

Threshold

Confidence cut-off: 85%

At the current threshold, categories above this value are flagged.

More sensitiveMore strict
Lower threshold = more flags (higher false positives). Higher threshold = fewer flags (higher false negatives).
Interpretation: Confidence scores are not truth. They reflect how strongly the model predicts a category. Use the threshold to see sensitivity changes.

Category results

What the model flagged

The table shows category probabilities and how they compare to your current threshold.

Run analysis to see category probabilities.

What each category means

Category meanings and why flags can happen

Different labels represent different kinds of potentially harmful language. The same sentence can trigger multiple categories.

False positives

Harmless text gets flagged

Harmless text can be flagged because the model only sees language patterns, not intent or context. For example, discussing harmful topics for research or education may still look “toxic” to a pattern-based classifier.

False negatives

Harmful text gets missed

Harmful language may not be flagged if it uses slang, new wording, sarcasm, or quoted text that doesn’t match patterns seen during training. A low-probability output does not guarantee safety.

Content moderation concepts

How decisions are made

Toxicity detection is a text classification task where an AI model predicts whether language may be harmful, abusive, insulting, threatening, or otherwise unsafe.

A threshold converts probability scores into decisions. If a category score is above the threshold, the tool flags that category. Changing the threshold changes how sensitive the moderation system is.

At the current threshold, categories above this value are flagged. Lower thresholds can increase false positives, while higher thresholds can increase false negatives.

Step-by-step pipeline

How toxicity detection turns text into decisions

This is a learning view: it makes the flow of inference and thresholding explicit.

1

Text input

Students type/paste a sentence. The demo uses your text only for local inference.

2

Tokenisation

The model converts text into tokens (small pieces) so it can work with language patterns.

3

Numerical representation

Tokens become numbers using the model’s learned representation space.

4

Model analyses patterns

The model estimates how likely the text matches harmful-language patterns it learned during training.

5

Category probabilities

For each harmful category, the model outputs a probability score.

6

Threshold is applied

Your chosen threshold converts probability scores into flagged/not-flagged decisions.

7

Flagged categories

Categories above the threshold are shown as flagged in the dashboard.

8

Human review

In real moderation, humans check context before acting on AI flags.

How toxicity detection works

From labels to flags (and why context matters)

The model has been trained on labelled text examples. It predicts probabilities for harmful-language categories based on language patterns it learned during training.

The demo does not truly understand intent, humour, sarcasm, or social context like a human. That’s why human review is important in real moderation systems.

Bias & fairness

Responsible AI, not “automatic truth”

Models can reflect bias in their training data. That means moderation outputs can vary across dialects, identities, or discussion contexts.

Why bias can appear

  • Training examples may be unevenly labeled across communities.
  • Reclaimed language and cultural context can be misunderstood.
  • Dialect, slang, and writing style can change probabilities.
  • Humour, sarcasm, and quotations can look like toxicity.

What students should do

  • Treat scores as signals, not verdicts.
  • Test across varied examples.
  • Prefer human review for important decisions.
  • Look for false positives and false negatives.
Criticism is not always toxicity. The goal of moderation systems should be harm reduction with careful oversight.

Human review

Context still matters

Moderation AI can help prioritise review, but it should not automatically punish users without context.

Humans can consider

  • Intent and purpose (education, reporting, debate).
  • Quote context (what is being discussed vs directed at someone).
  • Satire and cultural meaning.
  • Whether the message is asking for help or encouraging harm.

Treat this demo as decision support

The model output is a learning tool. In real deployments, human oversight is required for safety, fairness, and accountability.

Limitations & ethics

Use this demo for learning only

  • The model can make mistakes. Confidence is not the same as truth.
  • Sarcasm, humour, slang, and context are difficult for models.
  • Models can be biased based on training data and labeling practices.
  • Harmless educational or quoted content can still be flagged.
  • Some harmful text can be missed.
  • Do not use this demo for real disciplinary, legal, workplace, academic, or safety-critical decisions.
  • This demo is for teaching and understanding thresholds, not final judgment.

Student learning outcomes

What you will learn

  • Understand toxicity detection as text classification.
  • Understand moderation categories (toxicity, insults, threats, identity attacks, etc.).
  • Interpret probability scores and threshold decisions.
  • Recognise false positives and false negatives.
  • Understand bias and fairness limitations.
  • Learn why human review still matters.
  • See how pre-trained models can run in the browser.

Quick reminder

The model performs inference only. It does not learn from your input, and your text is not uploaded to a server.