Machine learning clustering

PCA and Clustering Analysis Tool

Upload a CSV or use a teaching dataset, reduce many numeric features into PC1 and PC2, visualize the clusters, and inspect the actual PCA calculations students need to understand.

CSV uploadPCA calculationsK-means clusters

Current dataset

Teaching clusters

Rows

Features

Clusters

PCA and clustering run locally in your browser. The CSV is not uploaded.

PCA plot

Reduced from 4 features to 2 principal components

Each point is one row from your dataset. The coordinates are PCA scores, and colors are k-means clusters computed on PC1 and PC2.

Cluster 1Cluster 2Cluster 3

61.76%

PC1 variance

37.95%

PC2 variance

99.71%

PC1 + PC2

Numeric features

Step 1

Normalize every numeric feature

PCA is variance-based, so large-scale columns can dominate unless we standardize features first.

z = (x - mean) / standard deviation
feature_1: (2.1 - 5.95) / 2.8745 = -1.3394

Feature	Mean	Std dev
feature_1	5.95	2.8745
feature_2	3.5333	1.7333
feature_3	4.8417	2.9243
feature_4	2.65	1.6517

Step 2

Compute the covariance matrix

The covariance matrix tells PCA which standardized features move together.

covariance(feature i, feature j) = sum(z_i x z_j) / (n - 1)

Feature	feature_1	feature_2	feature_3	feature_4
feature_1	1	0.057	0.997	0.48
feature_2	0.057	1	-0.004	0.892
feature_3	0.997	-0.004	1	0.423
feature_4	0.48	0.892	0.423	1

Step 3

Find eigenvalues and eigenvectors

Eigenvectors become principal component directions. Eigenvalues tell us how much variance each direction explains.

Component	Eigenvalue	Variance	Cumulative
PC1	2.4702	61.76%	61.76%
PC2	1.5181	37.95%	99.71%
PC3	0.011	0.27%	99.98%
PC4	0.0007	0.02%	100%

PC1 loadings

feature_1: 0.5496feature_2: 0.3501feature_3: 0.5285feature_4: 0.544

Step 4

Project rows and cluster the PCA scores

Each row is multiplied by the principal component vectors. K-means then groups nearby rows in the PC1/PC2 space.

PC score = standardized row dot principal component vector
Sample 1 PC1 = -2.2326, PC2 = -0.201

Sample	Label	Cluster	PC1	PC2
Sample 1	Cluster A	1	-2.2326	-0.201
Sample 2	Cluster A	1	-2.0945	-0.0858
Sample 3	Cluster A	1	-2.1837	-0.3329
Sample 4	Cluster A	1	-1.9075	-0.1544
Sample 5	Cluster B	2	1.1585	-1.3303
Sample 6	Cluster B	2	1.2945	-1.1607
Sample 7	Cluster B	2	1.0597	-1.4703
Sample 8	Cluster B	2	1.5549	-1.3324

Original data

Detected numeric features

Numeric columns used for PCA: feature_1, feature_2, feature_3, feature_4. Label column: group.

sample	group	feature_1	feature_2	feature_3	feature_4
A1	Cluster A	2.1	2.4	1.2	0.8
A2	Cluster A	2.5	2.2	1.4	1.0
A3	Cluster A	2.2	2.8	1.1	0.7
A4	Cluster A	2.8	2.5	1.6	1.1
B1	Cluster B	6.2	5.8	4.9	4.6
B2	Cluster B	6.6	5.5	5.2	4.8
B3	Cluster B	5.9	6.1	4.7	4.4
B4	Cluster B	6.9	6.0	5.4	5.0
C1	Cluster C	9.1	2.1	8.2	2.2
C2	Cluster C	8.7	2.4	7.9	2.5
C3	Cluster C	9.5	1.9	8.5	2.0
C4	Cluster C	8.9	2.7	8.0	2.7

What students should learn

PCA reduces many correlated features into fewer directions that preserve variance.

The plot is not magic: it comes from normalization, covariance, eigenvectors, and projection. Students can inspect each step, then connect the result to clustering and exploratory data analysis.

Browse Machine Learning

About t-SNE and UMAP

This page implements PCA now because PCA has transparent mathematics students can calculate by hand. The page is structured so t-SNE and UMAP can be added later as additional browser-side projection methods.