LightGBM Model Engine for Parsnip • lightsnip

Lightsnip is a hard fork of curso-r/treesnip. It adds LightGBM bindings for parsnip and enables more advanced LightGBM features, such as early stopping. It is not intended for general use, only as a dependency for CCAO regression models.

For detailed documentation on included functions, visit the full reference list.

Installation

You can install the released version of lightsnip directly from GitHub with one of the following commands:

# Using remotes
remotes::install_github("ccao-data/lightsnip")

# Using renv
renv::install("ccao-data/lightsnip")

# Using pak
pak::pak("ccao-data/lightsnip")

# Append the @ symbol for a specific version
remotes::install_github("ccao-data/lightsnip@0.0.5")

Once it is installed, you can use it just like any other package. Simply call library(assessr) at the beginning of your script.

Differences compared to treesnip

Removed support for tree and catboost (LightGBM only)
Removed classification support for LightGBM (regression only)
Removed treesnip caps and warnings on max_depth, other parameters
Removed vignettes and samples
Remap parameters to engine args instead of parsnip model args
Added LightGBM-specific hyperparameter functions
Added LightGBM-specific save/load helpers
Added recipe/fit cleaning helpers
Force user to specify categorical columns by name, does not implicitly convert factors to categoricals
Added early stopping from xgboost
Added more unit tests
Fixed a number of bugs

Basic usage with Tidymodels

Here is a quick example using lightsnip with a Tidymodels cross-validation workflow:

library(dplyr)
library(lightgbm)
library(lightsnip)
library(parsnip)
library(recipes)
library(workflows)

# Create a dataset for training
mtcars_train <- mtcars %>%
  dplyr::slice(1:28) %>%
  sample_n(size = 500, replace = TRUE) %>%
  mutate(cyl = as.factor(cyl), vs = as.factor(vs))

# Create a test set
mtcars_test <- mtcars %>%
  dplyr::slice(29:32) %>%
  mutate(cyl = as.factor(cyl), vs = as.factor(vs))

# Recipe to convert factors to categorical integers
rec <- recipe(mpg ~ ., mtcars_train) %>%
  step_integer(all_nominal(), zero_based = TRUE)

# Split data into V-folds
resamples <- rsample::vfold_cv(mtcars_train, v = 2)

# Create a model specification. LightGBM-specific parameters are passed to
# set_engine, NOT to boost_tree
model <- parsnip::boost_tree(
  trees = tune::tune()
) %>%
  parsnip::set_engine(
    engine = "lightgbm",
    verbose = -1,
    learning_rate = tune::tune(),
    min_gain_to_split = tune::tune(),
    feature_fraction = tune::tune(),
    min_data_in_leaf = tune::tune(),
    max_depth = tune::tune()
  )

# Run grid search
search <- tune::tune_grid(
  parsnip::set_mode(model, "regression"),
  preprocessor = rec,
  resamples = resamples,
  param_info = model %>%
    hardhat::extract_parameter_set_dials() %>%
    stats::update(
      learning_rate = learning_rate(),
      min_gain_to_split = min_gain_to_split(),
      feature_fraction = feature_fraction(),
      min_data_in_leaf = min_data_in_leaf(c(1L, 2L)),
      max_depth = max_depth(c(3L, 6L))
    ),
  grid = 2,
  metrics = yardstick::metric_set(yardstick::rmse)
)

# Finalize model
final <- model %>%
  tune::finalize_model(tune::select_best(search)) %>%
  parsnip::set_mode("regression") %>%
  parsnip::fit(mpg ~ ., bake(prep(rec), mtcars_train))

# Predict on test set
mtcars_test %>%
  mutate(pred_mpg = predict(final, bake(prep(rec), .))$.pred) %>%
  select(actual_mpg = mpg, pred_mpg) %>%
  knitr::kable(digits = 2)

	actual_mpg	pred_mpg
Ford Pantera L	15.8	14.08
Ferrari Dino	19.7	21.20
Maserati Bora	15.0	13.72
Volvo 142E	21.4	22.12

Lightsnip

Installation

Differences compared to treesnip

Basic usage with Tidymodels

Links

License

Citation

Developers