Classification algorithm
Logistic regression
A linear classifier for predicting a categorical label from feature columns. Trains a scikit-learn LogisticRegression behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard classification metric set plus coefficient-based feature importance.
It is the interpretable linear baseline alongside lightgbm-classifier-v1 — pick it when you want a fast, transparent model whose coefficients you can read, or as a yardstick for judging whether a more complex model is earning its keep.
What it does
You point it at a DataSource and pick:
- a categorical target column — the label to predict (yes/no, churned/retained, a segment, a status), and
- one or more feature columns the model gets to look at.
Feature columns may be numeric, boolean, or string/categorical. Unlike the LightGBM classifier — which consumes categoricals natively — logistic regression needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)
The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.
How it works
The pipeline shape is identical to the LightGBM classifier — the same classifier_train / classifier_eval nodes:
data_source → random_split → train_data + test_data
│ │
▼ ▼
model ─────────► classifier_train classifier_eval
│ ▲
▼ │
trained_model ──────────────────┘
│
▼
eval_result
classifier_train is fit-only; classifier_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — logistic regression and LightGBM share the exact same scoring + metric code.
The preprocessing pipeline
The model is a scikit-learn Pipeline. Inside it:
| Feature kind | Steps |
|---|---|
| Numeric / boolean | impute missing values with the median → standardize to zero mean, unit variance |
| String / categorical | impute missing values with the most frequent value → one-hot encode |
The standardization matters twice over: it helps the optimizer converge, and it puts the fitted coefficients on a comparable scale for the importance chart.
The whole fitted pipeline — imputers, scaler, encoder, and coefficients — is serialized as one unit, so inference replays exactly what was fit.
Missing values & unseen categories
- Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates
NaNnatively; logistic regression does not, so this step is required — the model handles it so you don't have to. - Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM classifier treats an unseen level as missing.
Metric set
Same as the LightGBM classifier — the eval step is shared:
| Metric | Meaning |
|---|---|
| Accuracy | Fraction of test rows classified correctly |
| Precision / Recall / F1 | Weighted averages across classes (binary: of the positive class) |
| ROC AUC | Ranking quality — binary single score, multi-class macro one-vs-rest |
| Confusion matrix | Per-class true-vs-predicted counts |
Feature importance
The chart shows standardized-coefficient magnitude — |coefficient| for each (one-hot-expanded) feature. Because features are scaled to unit variance before the fit, these magnitudes are roughly comparable across columns. For multi-class models the per-class coefficient rows are collapsed by mean absolute value.
These are linear coefficients, not split gains — they are not numerically comparable to the LightGBM classifier's importance bars, and they describe linear effects only.
Hyperparameters
Pass these on the model node's hyperparams (all optional):
| Key | Default | Meaning |
|---|---|---|
C | 1.0 | Inverse regularization strength — smaller means stronger regularization |
max_iter | 1000 | Solver iteration cap (raised above scikit-learn's default of 100 because the standardized + one-hot space often needs more) |
class_weight | None | Set to "balanced" to up-weight minority classes on imbalanced data |
Regularization is L2 by default (controlled by C); scikit-learn's penalty argument is deprecated and not exposed.
Limitations
- Linear decision boundary. Logistic regression models a linear relationship between features and the log-odds. On its own it cannot capture feature interactions or non-linear effects — if accuracy lags the LightGBM classifier badly, that is usually why. Engineer interaction features upstream, or use the LightGBM classifier.
- One-hot blow-up on high-cardinality columns. A categorical feature with hundreds of distinct values becomes hundreds of indicator columns. Prefer the LightGBM classifier (native categorical splits) for high-cardinality features, or reduce cardinality upstream.
- Random split assumes IID rows. If your rows have temporal structure (before vs. after some date), the classification templates are not the right tool — use the forecast template.
See also
lightgbm-classifier-v1.md— gradient-boosted sister; non-linear, native categorical handling, usually higher accuracy.lightgbm-regressor-v1.md— numeric-target sister with the same fit/eval shape.
Not sure which to pick?
Choosing a classification algorithmLightGBM vs Logistic regression vs Random forest for predicting a category — start linear, go to trees when the boundary curves, and handle imbalanced classes.