Regression algorithm
Random forest regression
A non-linear model for predicting a numeric value from feature columns. Trains a scikit-learn RandomForestRegressor — an ensemble of decision trees (bagging) — behind a preprocessing pipeline (impute → scale numeric, impute → one-hot categorical), scores on a random hold-out split, and surfaces the standard regression metric set plus impurity-based feature importance.
It is the robust non-linear baseline alongside the linear regressors (ridge-regressor-v1, ols-regressor-v1) and the gradient-boosted lightgbm-regressor-v1. Pick it when the relationship between features and target is not linear, when features interact, and you want a model that works well with little tuning.
What it does
You point it at a DataSource and pick:
- a numeric target column you want to predict, and
- one or more feature columns the model gets to look at.
Feature columns may be numeric, boolean, or string/categorical. Like the linear regressors — and unlike the LightGBM regressor, which consumes categoricals natively — a scikit-learn forest needs every feature numeric, so the model does that conversion for you, internally: there are no encoder or scaler nodes to wire. (You still can wire preprocessing upstream if you want explicit control.)
The output is a trained model + an eval_result carrying the metrics, predictions on the test rows, and a feature-importance chart.
How it works
The pipeline shape is identical to the other regressors — the same regressor_train / regressor_eval nodes:
data_source → random_split → train_data + test_data
│ │
▼ ▼
model ─────────► regressor_train regressor_eval
│ ▲
▼ │
trained_model ──────────────────┘
│
▼
eval_result
regressor_train is fit-only; regressor_eval runs the real prediction pass on the held-out test frame and emits the final scored result. The evaluation step is algorithm-agnostic — random forest, ridge, OLS, and LightGBM share the exact same scoring + metric code.
What a random forest is
A random forest fits many decision trees and averages their predictions. Each tree is trained on a bootstrap sample of the rows, and at every split only a random subset of the features is considered. That double dose of randomness makes the individual trees disagree with each other; averaging disagreeing trees cancels out their individual quirks (variance) without adding bias. The result is a model that captures non-linear effects and feature interactions automatically — no interaction terms to engineer — and is hard to overfit badly.
This is bagging (bootstrap aggregating). It is a different strategy from the LightGBM regressor's boosting, which builds trees sequentially, each correcting the last. Bagging is the more forgiving of the two: there is no learning rate, no early stopping, and the default settings are a solid baseline.
The preprocessing pipeline
The model is a scikit-learn Pipeline. Inside it:
| Feature kind | Steps |
|---|---|
| Numeric / boolean | impute missing values with the median → standardize to zero mean, unit variance |
| String / categorical | impute missing values with the most frequent value → one-hot encode |
Scaling is not needed for a tree-based model — decision-tree splits are scale-invariant, so standardizing numeric features changes nothing. It is kept only so every scikit-learn regressor shares one pipeline shape; it is a harmless no-op here.
The whole fitted pipeline — imputers, scaler, encoder, and the forest — is serialized as one unit, so inference replays exactly what was fit.
Missing values & unseen categories
- Missing values are imputed inside the pipeline (median for numeric, most-frequent for categorical). LightGBM tolerates
NaNnatively; a scikit-learn forest does not, so this step is required — the model handles it so you don't have to. - Unseen categories at inference — a category value that never appeared in training becomes an all-zero indicator rather than an error, matching how the LightGBM regressor treats an unseen level as missing.
Metric set
Same as the other regressors — the eval step is shared:
| Metric | Meaning |
|---|---|
| MAE | Mean absolute error — average prediction error in target units |
| RMSE | Root mean squared error — penalizes large errors more heavily |
| R² | Coefficient of determination — fraction of variance explained (1.0 = perfect, 0.0 = no better than predicting the mean) |
| MAPE | Mean absolute percentage error — relative error, None if any test row has target == 0 |
The runs panel surfaces RMSE as the headline number; all four are visible on the eval result detail.
Feature importance
The chart shows impurity decrease (mean decrease in impurity, MDI) — for each (one-hot-expanded) feature, how much that feature reduced prediction error across all the splits that used it, averaged over every tree. The values are non-negative and sum to 1.
These are impurity-based importances — they are not numerically comparable to the linear regressors' |coefficient| bars, nor to the LightGBM regressor's split gains. One known quirk: MDI tends to inflate the importance of high-cardinality features (a column with many distinct values has more opportunities to split). Read the ranking, not the absolute numbers.
Hyperparameters
Pass these on the model node's hyperparams (all optional):
| Key | Default | Meaning |
|---|---|---|
n_estimators | 100 | Number of trees in the forest — more trees give a more stable fit at a higher training cost; accuracy plateaus rather than overfitting as this grows |
max_depth | None | Maximum depth of each tree — None grows trees fully; set a cap to make a smaller, faster, more regularized model |
min_samples_leaf | 1 | Minimum rows in a leaf — raising it smooths predictions and curbs overfitting on noisy data |
max_features | 1.0 | Fraction of features considered at each split — lower values decorrelate the trees more |
n_jobs | -1 | CPU cores used to fit trees in parallel — -1 uses all cores |
The run seed drives the bootstrap row sampling and the per-split feature subsampling, so runs are reproducible.
Limitations
- Poor extrapolation. A forest predicts by averaging training targets, so it can never predict a value outside the range it saw in training. If your target trends beyond the training range, prefer a linear regressor or LightGBM.
- Larger model size. A forest of fully grown trees serializes to a much larger artifact than a linear model or a single LightGBM booster. Cap
max_depthor lowern_estimatorsif artifact size matters. - Importance bias. Impurity-based importance over-credits high-cardinality features — use the ranking as a guide, not a precise measurement.
- Usually edged out by boosting. On well-behaved tabular data the LightGBM regressor often reaches slightly higher accuracy. The random forest's edge is robustness and near-zero tuning, not peak accuracy.
- No prediction intervals. This is point regression — the model outputs a single number per row. If you need uncertainty bands, the time-series forecaster ships them.
- Random split assumes IID rows. If your data has temporal structure (rows from before vs. after some date should be split that way), use the forecast template instead.
See also
lightgbm-regressor-v1.md— gradient-boosted sister; also tree-based and non-linear, with native categorical handling and usually slightly higher accuracy.ridge-regressor-v1.md— the interpretable linear baseline with the same pipeline shape.ols-regressor-v1.md— the unregularized linear baseline.
Not sure which to pick?
Choosing a regression algorithmLightGBM vs Ridge vs OLS vs Random forest for predicting a number — start with a linear baseline, and when to reach for a tree-based model.