Regression algorithm

Ridge regression

A linear model for predicting a numeric value from feature columns. Trains a scikit-learn Ridge regressor (least squares with L2 regularization) behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard regression metric set plus coefficient-based feature importance.

It is the interpretable linear baseline alongside lightgbm-regressor-v1 — pick it when you want a fast, transparent model whose coefficients you can read, or as a yardstick for judging whether a more complex model is earning its keep.

What it does

You point it at a DataSource and pick:

  • a numeric target column you want to predict, and
  • one or more feature columns the model gets to look at.

Feature columns may be numeric, boolean, or string/categorical. Unlike the LightGBM regressor — which consumes categoricals natively — a linear model needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)

The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.

How it works

The pipeline shape is identical to the LightGBM regressor — the same regressor_train / regressor_eval nodes:

data_source → random_split → train_data + test_data
                                  │           │
                                  ▼           ▼
              model ─────────► regressor_train       regressor_eval
                                  │                        ▲
                                  ▼                        │
                           trained_model ──────────────────┘
                                                           │
                                                           ▼
                                                      eval_result

regressor_train is fit-only; regressor_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — ridge regression and LightGBM share the exact same scoring + metric code.

The preprocessing pipeline

The model is a scikit-learn Pipeline. Inside it:

Feature kindSteps
Numeric / booleanimpute missing values with the median → standardize to zero mean, unit variance
String / categoricalimpute missing values with the most frequent value → one-hot encode

Standardizing the numeric features is not cosmetic for ridge: the L2 penalty is scale-sensitive, so without it a feature measured in large units would be regularized far more weakly than one in small units. Scaling puts every coefficient on a comparable footing — both for the penalty and for the importance chart.

The whole fitted pipeline — imputers, scaler, encoder, and coefficients — is serialized as one unit, so inference replays exactly what was fit.

Why ridge, not plain least squares

Ridge adds an L2 penalty (alpha) on the coefficient magnitudes. Plain ordinary least squares has no penalty — it overfits readily once there are many one-hot-expanded columns or correlated features, and the scaler would be a no-op. The penalty keeps the fit stable and the coefficients well-behaved; alpha is the one knob that trades fit against regularization.

Missing values & unseen categories

  • Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates NaN natively; a linear model does not, so this step is required — the model handles it so you don't have to.
  • Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM regressor treats an unseen level as missing.

Metric set

Same as the LightGBM regressor — the eval step is shared:

MetricMeaning
MAEMean absolute error — average prediction error in target units
RMSERoot mean squared error — penalizes large errors more heavily
Coefficient of determination — fraction of variance explained (1.0 = perfect, 0.0 = no better than predicting the mean)
MAPEMean absolute percentage error — relative error, None if any test row has target == 0

The runs panel surfaces RMSE as the headline number; all four are visible on the eval result detail.

Feature importance

The chart shows standardized-coefficient magnitude|coefficient| for each (one-hot-expanded) feature. Because features are scaled to unit variance before the fit, these magnitudes are roughly comparable across columns.

These are linear coefficients, not split gains — they are not numerically comparable to the LightGBM regressor's importance bars, and they describe linear effects only.

Hyperparameters

Pass these on the model node's hyperparams (all optional):

KeyDefaultMeaning
alpha1.0L2 regularization strength — larger means stronger regularization (coefficients shrink toward zero); smaller approaches plain least squares
solver"auto"Linear-system solver — "auto" picks a direct method for dense data; "sag" / "saga" are iterative and honor the run seed

Ridge's default solver is direct and deterministic, so no iteration cap is needed.

Limitations

  • Linear relationship only. Ridge models a linear relationship between features and the target. On its own it cannot capture feature interactions or non-linear effects — if accuracy lags the LightGBM regressor badly, that is usually why. Engineer interaction features upstream, or use the LightGBM regressor.
  • One-hot blow-up on high-cardinality columns. A categorical feature with hundreds of distinct values becomes hundreds of indicator columns. Prefer the LightGBM regressor (native categorical splits) for high-cardinality features, or reduce cardinality upstream.
  • No prediction intervals. This is point regression — the model outputs a single number per row. If you need uncertainty bands, the time-series forecaster ships them.
  • Random split assumes IID rows. If your data has temporal structure (rows from before vs. after some date should be split that way), use the forecast template instead.

See also

  • lightgbm-regressor-v1.md — gradient-boosted sister; non-linear, native categorical handling, usually higher accuracy.
  • logreg-classifier-v1.md — the categorical-target linear baseline with the same pipeline shape.

Not sure which to pick?

Choosing a regression algorithm

LightGBM vs Ridge vs OLS vs Random forest for predicting a number — start with a linear baseline, and when to reach for a tree-based model.