Classification algorithm

LightGBM classification

Gradient-boosted classification for predicting a categorical label from feature columns. Trains one LightGBM booster on a random hold-out split, scores on the held-out test set, and surfaces the standard classification metric set (accuracy, precision / recall / F1, ROC AUC, confusion matrix) plus split-gain feature importance.

It is the accuracy-first option in the classification family — alongside the interpretable linear logreg-classifier-v1 and the bagging-based random-forest-classifier-v1. Pick it when you want the strongest out-of-the-box accuracy on tabular data, when features interact in non-linear ways, or when you have high-cardinality categorical features it can split on natively.

What it does

You point it at a DataSource and pick:

  • a categorical target column — the label to predict (yes/no, churned/retained, a segment, a status), and
  • one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical — LightGBM handles all three natively. No manual one-hot encoding or normalization is required at this stage; if you want preprocessing (imputation, scaling, encoding) you can wire it explicitly upstream of the trainer node.

Binary and multi-class targets are both supported — the trainer picks the objective from the number of distinct labels (see Objective below).

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the other classifiers — the same classifier_train / classifier_eval nodes:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► classifier_train      classifier_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

classifier_train is fit-only — it produces a fitted booster but no metrics. classifier_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — LightGBM, logistic regression, and random forest share the exact same scoring + metric code.

What boosting is

LightGBM builds an ensemble of decision trees sequentially: each new tree is fit to the errors the trees so far have left behind, and its contribution is scaled down by a learning rate before being added in. Over many rounds the ensemble converges on the patterns that matter. This is gradient boosting.

It is a different strategy from the random forest classifier's bagging, which fits many independent trees on bootstrap samples and averages them. Boosting usually reaches higher accuracy on well-behaved tabular data, because each tree is actively correcting the ensemble rather than just voting. The trade-off is more knobs — a learning rate and tree-complexity settings that interact — where a random forest is near-zero-tuning.

Objective

The trainer picks the LightGBM objective from the number of distinct labels in the target column — you don't set it:

  • Two classesobjective="binary" (logistic loss).
  • Three or moreobjective="multiclass", with num_class set to the label count (softmax).

Because the objective is derived from the data, don't override objective / num_class / metric through hyperparams — the Hyperparameters table below lists what is safe to tune.

Feature handling

  • Numeric columns pass through as-is — LightGBM tolerates NaN natively, so missing numbers need no imputation.
  • Boolean columns are cast to int.
  • Categorical / string columns are handed to LightGBM with the pandas Categorical dtype, so the booster uses Fisher categorical splits — no one-hot blow-up, even on high-cardinality columns. Each column's vocabulary is captured at fit time, persisted into the registered model's model_feature_schema.categorical_features, and replayed on inference.

This native handling is the LightGBM classifier's main edge over the linear and forest classifiers, both of which one-hot every categorical column.

Missing values & unseen categories

  • Missing values need no imputation — LightGBM treats NaN as a value its splits can route, unlike the scikit-learn classifiers, which require an upstream imputer.
  • Unseen categories at inference — a category value that never appeared in training is treated as missing rather than raising an error, so a new level in production data degrades gracefully.

Metric set

Same as the other classifiers — the eval step is shared:

MetricMeaning
AccuracyFraction of test rows classified correctly
Precision / Recall / F1Weighted averages across classes (binary: of the positive class)
ROC AUCRanking quality — binary single score, multi-class macro one-vs-rest
Confusion matrixPer-class true-vs-predicted counts

The eval result carries a confusion matrix chart for every run, and an ROC curve chart for binary targets (it is omitted for multi-class).

Feature importance

The chart shows split gain — for each feature, the total reduction in loss contributed by every split that used it, summed across all trees (LightGBM's importance_type="gain"). Higher means the feature did more to separate the classes.

These are gain-based importances. They are not numerically comparable to logistic regression's |coefficient| bars, nor to the random forest classifier's impurity-decrease (MDI) values — each model measures importance in its own units. Read the ranking, not the absolute numbers.

Hyperparameters

Pass these on the model node's hyperparams (all optional — the defaults are a solid baseline):

KeyDefaultMeaning
learning_rate0.05Step size for each boosting round — lower learns more slowly but usually generalizes better
num_leaves31Maximum leaves per tree — the main complexity dial; higher fits more intricate patterns at the risk of overfitting
feature_fraction0.9Fraction of features sampled per tree — below 1.0 decorrelates the trees and curbs overfitting
bagging_fraction0.8Fraction of rows sampled per tree — adds row-level randomness
bagging_freq5How often (in boosting rounds) to resample the bagging row subset — 0 disables row bagging

Any other LightGBM training parameter can also be passed through hyperparams — e.g. is_unbalance: true to up-weight the minority class on imbalanced data. The run seed drives the row and feature subsampling, so runs are reproducible.

Limitations

  • More tuning than a random forest. The learning rate and num_leaves interact, and a too-large num_leaves overfits. The defaults are sound, but squeezing out the last few points of accuracy takes more care than the forest's near-zero tuning.
  • Probabilities are only roughly calibrated. Boosted-tree class probabilities rank well but are not guaranteed to be true probabilities. Trust the ranking (and ROC AUC) more than the raw numbers; calibrate downstream if you need honest probabilities.
  • Small data. With only a few hundred rows, a deep boosted ensemble can overfit; the interpretable logreg-classifier-v1 is often the safer baseline there.
  • Imbalanced classes. On a strong class imbalance the booster can drift toward the majority class — pass is_unbalance: true, or use a stratified split.
  • No automatic feature engineering. The classifier uses the columns you give it verbatim. Wire preprocessing nodes upstream if you want explicit imputation, scaling, or encoding.
  • Random split assumes IID rows. If your rows have temporal structure (rows from before vs. after some date should be split that way), the classification templates are not the right tool — use the forecast template.

See also

  • logreg-classifier-v1.md — the interpretable linear baseline; read its coefficients when you need to explain why.
  • random-forest-classifier-v1.md — the other tree-based classifier; bagging instead of boosting, more robust with less tuning.
  • lightgbm-regressor-v1.md — numeric-target sister with the same booster and fit/eval shape.

Not sure which to pick?

Choosing a classification algorithm

LightGBM vs Logistic regression vs Random forest for predicting a category — start linear, go to trees when the boundary curves, and handle imbalanced classes.