MuhammadLab
Machine learning clustering

PCA and Clustering Analysis Tool

Upload a CSV or use a teaching dataset, reduce many numeric features into PC1 and PC2, visualize the clusters, and inspect the actual PCA calculations students need to understand.

CSV uploadPCA calculationsK-means clusters

Current dataset

Teaching clusters

12

Rows

4

Features

3

Clusters

PCA and clustering run locally in your browser. The CSV is not uploaded.

PCA plot

Reduced from 4 features to 2 principal components

Each point is one row from your dataset. The coordinates are PCA scores, and colors are k-means clusters computed on PC1 and PC2.

PC1 (61.8% variance)PC2 (38% variance)Cluster ACluster ACluster ACluster ACluster BCluster BCluster BCluster BCluster CCluster CCluster CCluster C
Cluster 1Cluster 2Cluster 3

61.76%

PC1 variance

37.95%

PC2 variance

99.71%

PC1 + PC2

4

Numeric features

Step 1

Normalize every numeric feature

PCA is variance-based, so large-scale columns can dominate unless we standardize features first.

z = (x - mean) / standard deviation
feature_1: (2.1 - 5.95) / 2.8745 = -1.3394
FeatureMeanStd dev
feature_15.952.8745
feature_23.53331.7333
feature_34.84172.9243
feature_42.651.6517

Step 2

Compute the covariance matrix

The covariance matrix tells PCA which standardized features move together.

covariance(feature i, feature j) = sum(z_i x z_j) / (n - 1)
Featurefeature_1feature_2feature_3feature_4
feature_110.0570.9970.48
feature_20.0571-0.0040.892
feature_30.997-0.00410.423
feature_40.480.8920.4231

Step 3

Find eigenvalues and eigenvectors

Eigenvectors become principal component directions. Eigenvalues tell us how much variance each direction explains.

ComponentEigenvalueVarianceCumulative
PC12.470261.76%61.76%
PC21.518137.95%99.71%
PC30.0110.27%99.98%
PC40.00070.02%100%

PC1 loadings

feature_1: 0.5496feature_2: 0.3501feature_3: 0.5285feature_4: 0.544

Step 4

Project rows and cluster the PCA scores

Each row is multiplied by the principal component vectors. K-means then groups nearby rows in the PC1/PC2 space.

PC score = standardized row dot principal component vector
Sample 1 PC1 = -2.2326, PC2 = -0.201
SampleLabelClusterPC1PC2
Sample 1Cluster A1-2.2326-0.201
Sample 2Cluster A1-2.0945-0.0858
Sample 3Cluster A1-2.1837-0.3329
Sample 4Cluster A1-1.9075-0.1544
Sample 5Cluster B21.1585-1.3303
Sample 6Cluster B21.2945-1.1607
Sample 7Cluster B21.0597-1.4703
Sample 8Cluster B21.5549-1.3324

Original data

Detected numeric features

Numeric columns used for PCA: feature_1, feature_2, feature_3, feature_4. Label column: group.

samplegroupfeature_1feature_2feature_3feature_4
A1Cluster A2.12.41.20.8
A2Cluster A2.52.21.41.0
A3Cluster A2.22.81.10.7
A4Cluster A2.82.51.61.1
B1Cluster B6.25.84.94.6
B2Cluster B6.65.55.24.8
B3Cluster B5.96.14.74.4
B4Cluster B6.96.05.45.0
C1Cluster C9.12.18.22.2
C2Cluster C8.72.47.92.5
C3Cluster C9.51.98.52.0
C4Cluster C8.92.78.02.7

What students should learn

PCA reduces many correlated features into fewer directions that preserve variance.

The plot is not magic: it comes from normalization, covariance, eigenvectors, and projection. Students can inspect each step, then connect the result to clustering and exploratory data analysis.

Browse Machine Learning

About t-SNE and UMAP

This page implements PCA now because PCA has transparent mathematics students can calculate by hand. The page is structured so t-SNE and UMAP can be added later as additional browser-side projection methods.