Machine learning studioFrontend-onlyBrowser-local

Classification Studio — Random Forest vs XGBoost

Interactive classification playground: upload a CSV, choose features + target, and compare tree-based classifiers (Random Forest mode and XGBoost boosting). See accuracy, confusion matrix, precision/recall/F1, and export Python code.

Data

One dataset, many classifiers

Load a sample dataset or upload your own CSV. Choose the target (y) and the feature columns (X), then compare models.

Sample datasetUpload CSV (optional)

The file stays in your browser. Pick columns below after upload.

0 rows loadedBrowser-local

Column mapping

Target (y)EncodingOne-hot is usually safest when you have categorical features.

Feature columns (X)

Teaching note: this studio focuses on tabular classification. It can handle multiple feature columns (unlike the Regression Studio which currently fits one x).

Transform

Standardize numeric features

Often helpful for linear models; usually optional for trees (kept for teaching).Train/test split

10%Test: 25%50%

Seed

Data preview

Showing up to 8 columns and the first 8 rows.

Models

Random Forest vs XGBoost

Both models are trained using XGBoost (WASM). Random Forest mode uses num_parallel_tree with subsampling.

Model

Max depth

max_depth = 4

Boosting rounds (numRound)

numRound = 120

Learning rate (eta)

eta = 0.20

Subsample

subsample = 0.90

Column sample (by tree)

colsample_bytree = 0.90

L2 (lambda)L1 (alpha)

Train a model to see the confusion matrix and metrics.

Generated code (Python)

scikit-learn / XGBoost equivalent

# Classification Studio — Python export
# pip install pandas scikit-learn xgboost

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier

CSV_PATH = "your_dataset.csv"
FEATURE_COLS = [
  "FEATURE_1",
  "FEATURE_2"
]
TARGET_COL = "TARGET"
ENCODING = "onehot"
STANDARDIZE_NUMERIC = False
TEST_SIZE = 0.25
RANDOM_STATE = 42

df = pd.read_csv(CSV_PATH)
X = df[FEATURE_COLS]
y = df[TARGET_COL]

numeric_cols = X.select_dtypes(include=["number"]).columns.tolist()
categorical_cols = [c for c in X.columns if c not in numeric_cols]

if ENCODING == "onehot":
    cat_transformer = OneHotEncoder(handle_unknown="ignore")
elif ENCODING == "label":
    # Teaching note: label encoding is usually NOT recommended for trees with categoricals.
    # Prefer one-hot encoding. This placeholder keeps the pipeline structure simple.
    cat_transformer = OneHotEncoder(handle_unknown="ignore")
else:
    cat_transformer = "passthrough"

num_steps = []
if STANDARDIZE_NUMERIC:
    num_steps.append(("scaler", StandardScaler()))
num_transformer = Pipeline(steps=num_steps) if num_steps else "passthrough"

preprocess = ColumnTransformer(
    transformers=[
        ("num", num_transformer, numeric_cols),
        ("cat", cat_transformer, categorical_cols),
    ],
    remainder="drop",
)

clf = XGBClassifier(max_depth=4, learning_rate=0.2, n_estimators=120, subsample=0.9, colsample_bytree=0.9, reg_lambda=1, reg_alpha=0, random_state=RANDOM_STATE, tree_method="hist")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y)

pipe = Pipeline(steps=[("preprocess", preprocess), ("model", clf)])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))
print("\nClassification report:\n", classification_report(y_test, pred))

Teaching note: preprocessing choices (encoding/standardization) must match between the studio and Python for comparable results.