All prediction algorithms

Decision guide

Choosing a classification algorithm

Clarex ships three ways to predict a category — LightGBM classification, Logistic regression, and Random forest classification. All three take feature columns, train on a random hold-out split, handle binary and multi-class targets, and report the same metric set (accuracy, precision / recall / F1, ROC AUC, confusion matrix). What differs is how each one draws the boundary between classes — and how much it can tell you afterwards.

As with regression, the short version: start linear, go to trees when linear isn't enough.

At a glance

AlgorithmDecision boundaryInterpretabilityCategorical featuresTuning
Logistic regressionLinearHigh — readable coefficientsOne-hot, automaticMinimal
Random forest classificationNon-linearLow — importance ranking onlyOne-hot, automaticMinimal
LightGBM classificationNon-linearLow — importance ranking onlyNative categorical splitsModerate

Start with logistic regression

Logistic regression is the right first classifier almost every time. It fits a linear boundary between classes, trains instantly, and — because the features are standardized before the fit — its coefficients are roughly comparable, so you can read which features push a row toward which class. Even when you expect to need a tree model, logistic regression is the baseline that tells you whether the tree model is earning its complexity.

Its limit is in the name: the boundary is linear. If classes are separated by a curved or interaction-driven boundary, logistic regression can't bend to fit it — and a categorical feature with hundreds of values becomes hundreds of one-hot columns. When accuracy lags a tree model badly, that is usually why.

When to go non-linear

Reach for a tree-based classifier when the linear boundary leaves accuracy on the table.

  • Random forest classification averages many independent decision trees (bagging). It is the robust, low-effort option — solid defaults, no learning rate, hard to overfit badly. Pick it when you want non-linear accuracy with almost no tuning. One caveat: its predicted probabilities rank well but are pulled toward the middle, so trust the ranking (and ROC AUC) over the raw numbers.
  • LightGBM classification builds trees sequentially, each correcting the last (boosting). On well-behaved tabular data it usually reaches the highest accuracy of the three, and it splits on categorical columns natively — a real edge when you have high-cardinality categories. The trade-off is more tuning: a learning rate and tree-complexity settings that interact.

Between the two trees: random forest to set-and-forget; LightGBM for peak accuracy or messy high-cardinality categoricals.

A note on imbalanced classes

If one class is rare (fraud, churn, defects), accuracy alone is misleading — a model that always predicts the majority class can still score 95%. Watch precision, recall, and ROC AUC, and read the confusion matrix. Logistic regression and random forest both accept class_weight="balanced"; LightGBM accepts is_unbalance: true. And use a stratified split so the rare class is present in both the train and test halves.

Rules of thumb

  • Always run logistic regression first — it's free and it sets the bar.
  • Need to explain the prediction? Logistic regression's coefficients explain why; tree importances only rank what mattered.
  • Lots of high-cardinality categorical features? LightGBM handles them natively; the others one-hot every level.
  • Want non-linear accuracy with no tuning? Random forest.
  • Chasing the last few points of accuracy? LightGBM, with some tuning.
  • Rare, important class? Watch recall and ROC AUC, set the class weighting, and use a stratified split.

See also