Decision guide

Choosing a classification algorithm

Clarex ships three ways to predict a category — LightGBM classification, Logistic regression, and Random forest classification. All three take feature columns, train on a random hold-out split, handle binary and multi-class targets, and report the same metric set (accuracy, precision / recall / F1, ROC AUC, confusion matrix). What differs is how each one draws the boundary between classes — and how much it can tell you afterwards.

As with regression, the short version: start linear, go to trees when linear isn't enough.

At a glance

Algorithm	Decision boundary	Interpretability	Categorical features	Tuning
Logistic regression	Linear	High — readable coefficients	One-hot, automatic	Minimal
Random forest classification	Non-linear	Low — importance ranking only	One-hot, automatic	Minimal
LightGBM classification	Non-linear	Low — importance ranking only	Native categorical splits	Moderate

Start with logistic regression

Logistic regression is the right first classifier almost every time. It fits a linear boundary between classes, trains instantly, and — because the features are standardized before the fit — its coefficients are roughly comparable, so you can read which features push a row toward which class. Even when you expect to need a tree model, logistic regression is the baseline that tells you whether the tree model is earning its complexity.

Its limit is in the name: the boundary is linear. If classes are separated by a curved or interaction-driven boundary, logistic regression can't bend to fit it — and a categorical feature with hundreds of values becomes hundreds of one-hot columns. When accuracy lags a tree model badly, that is usually why.

When to go non-linear

Reach for a tree-based classifier when the linear boundary leaves accuracy on the table.

Random forest classification averages many independent decision trees (bagging). It is the robust, low-effort option — solid defaults, no learning rate, hard to overfit badly. Pick it when you want non-linear accuracy with almost no tuning. One caveat: its predicted probabilities rank well but are pulled toward the middle, so trust the ranking (and ROC AUC) over the raw numbers.
LightGBM classification builds trees sequentially, each correcting the last (boosting). On well-behaved tabular data it usually reaches the highest accuracy of the three, and it splits on categorical columns natively — a real edge when you have high-cardinality categories. The trade-off is more tuning: a learning rate and tree-complexity settings that interact.

Between the two trees: random forest to set-and-forget; LightGBM for peak accuracy or messy high-cardinality categoricals.

A note on imbalanced classes

If one class is rare (fraud, churn, defects), accuracy alone is misleading — a model that always predicts the majority class can still score 95%. Watch precision, recall, and ROC AUC, and read the confusion matrix. Logistic regression and random forest both accept class_weight="balanced"; LightGBM accepts is_unbalance: true. And use a stratified split so the rare class is present in both the train and test halves.

Rules of thumb

Always run logistic regression first — it's free and it sets the bar.
Need to explain the prediction? Logistic regression's coefficients explain why; tree importances only rank what mattered.
Lots of high-cardinality categorical features? LightGBM handles them natively; the others one-hot every level.
Want non-linear accuracy with no tuning? Random forest.
Chasing the last few points of accuracy? LightGBM, with some tuning.
Rare, important class? Watch recall and ROC AUC, set the class weighting, and use a stratified split.