Classification algorithm
LightGBM classification
Gradient-boosted classification for predicting a categorical label from feature columns. Trains one LightGBM booster on a random hold-out split, scores on the held-out test set, and surfaces the standard classification metric set (accuracy, precision / recall / F1, ROC AUC, confusion matrix) plus split-gain feature importance.
It is the accuracy-first option in the classification family — alongside the interpretable linear logreg-classifier-v1 and the bagging-based random-forest-classifier-v1. Pick it when you want the strongest out-of-the-box accuracy on tabular data, when features interact in non-linear ways, or when you have high-cardinality categorical features it can split on natively.
What it does
You point it at a DataSource and pick:
- a categorical target column — the label to predict (yes/no, churned/retained, a segment, a status), and
- one or more feature columns the model gets to look at.
Feature columns may be numeric, boolean, or string/categorical — LightGBM handles all three natively. No manual one-hot encoding or normalization is required at this stage; if you want preprocessing (imputation, scaling, encoding) you can wire it explicitly upstream of the trainer node.
Binary and multi-class targets are both supported — the trainer picks the objective from the number of distinct labels (see Objective below).
The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.
How it works
The pipeline shape is identical to the other classifiers — the same classifier_train / classifier_eval nodes:
data_source → random_split → train_data + test_data
│ │
▼ ▼
model ─────────► classifier_train classifier_eval
│ ▲
▼ │
trained_model ──────────────────┘
│
▼
eval_result
classifier_train is fit-only — it produces a fitted booster but no metrics. classifier_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — LightGBM, logistic regression, and random forest share the exact same scoring + metric code.
What boosting is
LightGBM builds an ensemble of decision trees sequentially: each new tree is fit to the errors the trees so far have left behind, and its contribution is scaled down by a learning rate before being added in. Over many rounds the ensemble converges on the patterns that matter. This is gradient boosting.
It is a different strategy from the random forest classifier's bagging, which fits many independent trees on bootstrap samples and averages them. Boosting usually reaches higher accuracy on well-behaved tabular data, because each tree is actively correcting the ensemble rather than just voting. The trade-off is more knobs — a learning rate and tree-complexity settings that interact — where a random forest is near-zero-tuning.
Objective
The trainer picks the LightGBM objective from the number of distinct labels in the target column — you don't set it:
- Two classes →
objective="binary"(logistic loss). - Three or more →
objective="multiclass", withnum_classset to the label count (softmax).
Because the objective is derived from the data, don't override objective / num_class / metric through hyperparams — the Hyperparameters table below lists what is safe to tune.
Feature handling
- Numeric columns pass through as-is — LightGBM tolerates
NaNnatively, so missing numbers need no imputation. - Boolean columns are cast to int.
- Categorical / string columns are handed to LightGBM with the pandas
Categoricaldtype, so the booster uses Fisher categorical splits — no one-hot blow-up, even on high-cardinality columns. Each column's vocabulary is captured at fit time, persisted into the registered model'smodel_feature_schema.categorical_features, and replayed on inference.
This native handling is the LightGBM classifier's main edge over the linear and forest classifiers, both of which one-hot every categorical column.
Missing values & unseen categories
- Missing values need no imputation — LightGBM treats
NaNas a value its splits can route, unlike the scikit-learn classifiers, which require an upstream imputer. - Unseen categories at inference — a category value that never appeared in training is treated as missing rather than raising an error, so a new level in production data degrades gracefully.
Metric set
Same as the other classifiers — the eval step is shared:
| Metric | Meaning |
|---|---|
| Accuracy | Fraction of test rows classified correctly |
| Precision / Recall / F1 | Weighted averages across classes (binary: of the positive class) |
| ROC AUC | Ranking quality — binary single score, multi-class macro one-vs-rest |
| Confusion matrix | Per-class true-vs-predicted counts |
The eval result carries a confusion matrix chart for every run, and an ROC curve chart for binary targets (it is omitted for multi-class).
Feature importance
The chart shows split gain — for each feature, the total reduction in loss contributed by every split that used it, summed across all trees (LightGBM's importance_type="gain"). Higher means the feature did more to separate the classes.
These are gain-based importances. They are not numerically comparable to logistic regression's |coefficient| bars, nor to the random forest classifier's impurity-decrease (MDI) values — each model measures importance in its own units. Read the ranking, not the absolute numbers.
Hyperparameters
Pass these on the model node's hyperparams (all optional — the defaults are a solid baseline):
| Key | Default | Meaning |
|---|---|---|
learning_rate | 0.05 | Step size for each boosting round — lower learns more slowly but usually generalizes better |
num_leaves | 31 | Maximum leaves per tree — the main complexity dial; higher fits more intricate patterns at the risk of overfitting |
feature_fraction | 0.9 | Fraction of features sampled per tree — below 1.0 decorrelates the trees and curbs overfitting |
bagging_fraction | 0.8 | Fraction of rows sampled per tree — adds row-level randomness |
bagging_freq | 5 | How often (in boosting rounds) to resample the bagging row subset — 0 disables row bagging |
Any other LightGBM training parameter can also be passed through hyperparams — e.g. is_unbalance: true to up-weight the minority class on imbalanced data. The run seed drives the row and feature subsampling, so runs are reproducible.
Limitations
- More tuning than a random forest. The learning rate and
num_leavesinteract, and a too-largenum_leavesoverfits. The defaults are sound, but squeezing out the last few points of accuracy takes more care than the forest's near-zero tuning. - Probabilities are only roughly calibrated. Boosted-tree class probabilities rank well but are not guaranteed to be true probabilities. Trust the ranking (and ROC AUC) more than the raw numbers; calibrate downstream if you need honest probabilities.
- Small data. With only a few hundred rows, a deep boosted ensemble can overfit; the interpretable
logreg-classifier-v1is often the safer baseline there. - Imbalanced classes. On a strong class imbalance the booster can drift toward the majority class — pass
is_unbalance: true, or use a stratified split. - No automatic feature engineering. The classifier uses the columns you give it verbatim. Wire preprocessing nodes upstream if you want explicit imputation, scaling, or encoding.
- Random split assumes IID rows. If your rows have temporal structure (rows from before vs. after some date should be split that way), the classification templates are not the right tool — use the forecast template.
See also
logreg-classifier-v1.md— the interpretable linear baseline; read its coefficients when you need to explain why.random-forest-classifier-v1.md— the other tree-based classifier; bagging instead of boosting, more robust with less tuning.lightgbm-regressor-v1.md— numeric-target sister with the same booster and fit/eval shape.
Not sure which to pick?
Choosing a classification algorithmLightGBM vs Logistic regression vs Random forest for predicting a category — start linear, go to trees when the boundary curves, and handle imbalanced classes.