Where should preprocessing live?

20th of January, 2022

…it depends

In my previous job I was an instructor at an immersive data science bootcamp. I earned a reputation for responding to student questions with “it depends”: “How do you measure performance for classification tasks?” “…it depends.” “What pre-processing steps should I use?” “…it depends.”

Of course, I didn’t stop there. The point was to help students learn how to navigate the “it depends,” so this would often be our starting point, from which students would start working through the implications of various decisions, with me listening and providing feedback on the thought process. I found this to be much more helpful to students than if I had simply said “in your case, you should use an f1 score”—even if that had been the best outcome given the student project, those details might change the next time, and the student would just be left with a simple rule instead of understanding.

Sane defaults

Since leaving the bootcamp environment, I’ve run into a few scenarios where I think “it depends” is not the best way to think about something. One such case is how we approach pre-processing: where should pre-processing code live, what do you do with data during intermediate steps, how should we think about preprocessing, etc. There are many options that one could consider, and when I look at example code online I see a diversity of approaches. But there is one approach that is, in my book, clearly better in the majority of cases.

Piecemeal preprocessing

Below, is code the scikit-learn examples: Faces recognition example using eigenfaces and SVMs. In this example, data is run through two pre-processing steps before it is trained on an SVM for a classification task (face recognition).

This is a common pattern that I’ve seen many times:

Separate your dataset for validation.
Apply preprocessing steps sequentially, taking care to fit_transform() training data and transform() test data.
Then train your model on the final output of step 2.

Here is how the scikit-learn example implements the dataset splitting and the first preprocessing step:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Here is the second preprocessing step:

n_components = 150

pca = PCA(n_components=n_components, svd_solver="randomized", whiten=True).fit(X_train)

eigenfaces = pca.components_.reshape((n_components, h, w))

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

And here is the model training:

param_grid = {
    "C": loguniform(1e3, 1e5),
    "gamma": loguniform(1e-4, 1e-1),
}
clf = RandomizedSearchCV(
    SVC(kernel="rbf", class_weight="balanced"), param_grid, n_iter=10
)
clf = clf.fit(X_train_pca, y_train)

I want to show another way of approaching this same problem before comparing approaches

On-model preprocessing

Here is what I think should be the default approach:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

model_pipe = pipeline.make_pipeline(
    StandardScaler(),
    PCA(n_components=150, svd_solver="randomized", whiten=True),
    SVC(kernel="rbf", class_weight="balanced"),
)

model_pipe.fit(X_train, y_train)

In the example above, RandomSearchCV is used for hyperparameter optimization. Using my approach, this can be done as so:

param_grid = {
    "svc__C": loguniform(1e3, 1e5),
    "svc__gamma": loguniform(1e-4, 1e-1),
}

tuner_pipe = RandomizedSearchCV(model_pipe, param_grid, n_iter=10)

tuner_pipe.fit(X_train, y_train)

Data leakage

Data leakage is important enough that it merits special attention here. Did you catch the data leakage in the piecemeal scikit-learn example?

In that example, the data was split into test and train, then the two preprocessing steps were trained on the training data. The problem appears when we use RandomSearchCV to optimize hyperparameters in our SVM model. The way RandomSearchCV is meant to work is that it takes several sets of possible values for hyperparameters, and then it estimates model performance for each set of hyperparams. We want to use the hyperparams that will give us the best performance in the real world, so we use cross-validation to estimate real world performance on unseen data. When the SVM is tested via cross-validation, it will be tested on data that the preprocessing steps have already seen. I’ve reproduced this in the figure below. In the figure, we can see that, when the SVM starts the first round of cross validation it is trained on Folds 2-5 and tested on Fold 1, but the preprocessing steps have already seen Fold 1.

adapted from https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance

This is certainly an innocuous case of data leakage. I’m intentionally picking on a case where the stakes are small and where there’s a good reason for taking the piecemeal approach (it is often more clear to break out steps for educational purposes). But in the real world this can have dire consequences: the entire point of steps like cross validation is to estimate performance in production, and make decisions based on that estimated performance. While the stakes were small in this toy case, they are potentially huge in applied data science. Data leakage can mean the expected performance you are reporting to stakeholders is not applicable in reality. It can mean that you are making modeling decisions based on one scenario, while the relevant scenario would call for entirely different decisions.

Why on-model

I’m suggesting that keeping preprocessing steps on-model should be the default approach to pre-processing in data science. One approach to accomplish this are to use scikit-learn’s pipelines (though there are many others). Their docs summarize the benefits of pipelines:

Convenience and encapsulation: You only have to call fit and predict once on your data to fit a whole sequence of estimators.

Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.

Safety: Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

Another way of thinking about this is that your preprocessing steps are modeling steps. A preprocessing step learns from your data, stores parameters based on what it has learned, and then transforms your data. Separating modeling out into several piecemeal steps means that you have more chances to mess up and cause data leakage and you lose convenience, as noted above.

I also think that this is a misleading way to think about modeling; it is a “mental model” that will lead you astray. When you separate your model into many, disjoint steps, it becomes much more difficult to reason about modeling.

When each of these steps is considered an extension of your model, things become much more clear. Do you want to take your model out of notebook land and productionize it for others at your company to use? It’s going to be a lot easier if all of the steps are reliably encapsulated into one whole. Want to save model state so that you can reproducibly monitor your model? It’s a lot easier to do this with one model than many steps. Do you need to cross-compile your model into javascript so that it can run on edge devices? Again, it will be a lot easier when every step is encapsulated into a framework that makes this easy.