Confusion Matrix and Model Evaluation Visualizer
Paste true labels and predicted labels, then compute confusion matrices and core evaluation metrics for computer vision classifiers.
Why evaluation matters
A computer vision model is not judged only by one prediction. Students need to compare many predictions against ground truth labels to understand accuracy, mistakes, and class-level weaknesses.
What this page measures
This visualizer calculates accuracy, precision, recall, F1-score, false positives, false negatives, and a confusion matrix so students can inspect model quality after prediction.
What to discuss
High accuracy can still hide poor performance on minority classes. The confusion matrix and per-class metrics reveal whether the model is mixing up specific categories.
Current counts: true = 10, predicted = 10
Confusion Matrix
| True \ Pred | cat | dog | bird |
|---|---|---|---|
| cat | 2 | 1 | 0 |
| dog | 1 | 2 | 0 |
| bird | 1 | 0 | 3 |
Overall Metrics
Accuracy
0.700
Macro Precision
0.722
Macro Recall
0.694
Macro F1
0.698
Weighted F1
0.714
Samples
10
Error Totals
False positives
3
False negatives
3
These are aggregated one-vs-rest counts across classes. They help students understand how often the model predicts a class incorrectly or misses a class when it should have predicted it.
Per-Class Metrics
| Class | Precision | Recall | F1 | TP | FP | FN | Support |
|---|---|---|---|---|---|---|---|
| cat | 0.500 | 0.667 | 0.571 | 2 | 2 | 1 | 3 |
| dog | 0.667 | 0.667 | 0.667 | 2 | 1 | 1 | 3 |
| bird | 1.000 | 0.750 | 0.857 | 3 | 0 | 1 | 4 |
How To Read It
- Accuracy tells how many predictions were correct overall, but it can hide weak performance on small classes.
- Precision answers: when the model predicts a class, how often is it right?
- Recall answers: when a class is truly present, how often does the model find it?
- F1-score balances precision and recall, which is useful when students want one summary score per class.
- The confusion matrix shows exactly which labels are being mixed up, which is often the most useful diagnostic after model prediction.