12.07.19
I was excited to start using Max Khun (creator of Caret's) new set of 'tidymodels' packages - rsample, recipe, yardstick, parsnip and dials. These are still under development but seem promising. The one I have so far found most useful is recipe. Here I'll give a quick overview of how you use it to do some simple data preparation for machine learning.
R's approach to machine learning has always been a bit haphazard and fragmented. There has never been an equivalent to python's scikit-learn. I have never really got along with caret (the main contender) or mlr. I found the API difficult to learn and I've never liked the amount of control you give up as a result of using them. I like the fact that these new set of packages are modular and so can be used without fully giving up on other approaches.
Basically, recipe provides a bunch of tools for preparing data and creating design matrices. This is a form of feature engineering. These matrices can then be used as training data for ML models. This is done in four steps:
Here is a quick example the does median imputation, centres and scales the airquality dataset to give an idea for how it would work.
library(recipes)
aq_train = airquality[1:100, ]
aq_test = airquality[-(1:100), ]
#make recipe
recipe_1 = recipe(formula = Ozone ~ Solar.R + Wind + Temp + Month + Day,
data = aq_train) %>%
#add steps
step_medianimpute(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale( all_numeric()) %>%
#prep recipe
prep(training = aq_train, retain = TRUE, verbose = TRUE)
#make model matrices
mm_train = bake(recipe_1, new_data = aq_train, composition = 'matrix')
mm_test = bake(recipe_1, new_data = aq_test, composition = 'matrix')
After doing this you can go off and do what you want with the model matrix. Changing the composition argument allows you to get a ""tibble", "matrix", "data.frame", or "dgCMatrix".
This approach is flexible and allows a prepped recipe to be applied to a new dataset avoiding data leakage problems. A list of available functions is here. User defined functions can also be made.
The recipe package is really useful and i've been using it a lot lately - it has streamlined a bit of my workflow that I'd been struggling with. It still has a few rough edges but is really worth trying out.