Recently, I found the following StackOverflow post as well as the corresponding paper that states that bagging based on resampling with and without replacement is equivalent in the scenarios considered by the authors. In particular, the paper states that they found:
An equivalence between bagging based on resampling with and without replacement: resample sizes and (for resampling with and without replacement, respectively) produce very similar bagged statistics if . This approximate equality holds for each sample, not just on the average over samples. While our derivation is limited to U-statistics, the equivalence seems to hold more widely, as illustrated by our limited experiments with bagging cart trees.in “Observations on Bagging” by Buja and Stuetzle (2006)
While the authors are optimistic that their finding may be general, this needs further investigation since their results are mainly limited to U-statistics. Also note that it is generally NOT possible to just replace sampling with replacement and sampling without replacement in any application scenario as argued in the notes on random forests below. Nevertheless, I think it is an interesting result. The following examples and notes try to illustrate what the paper “Observations on Bagging” by Buja and Stuetzle (2006) actually implicates and what it does not. Please feel free to comment and discuss 🙂
Here is some Python code illustrating the formula mentioned in the text:
import numpy as np
d = np.random.randint(1,10,100)
# from the paper
a_without = 0.8
a_with = a_without / (1 - a_without)
# without replacement
n = np.array([ np.random.choice(d, int(len(d) * a_without), replace=False).mean() for _ in range(1000) ] )
# with replacement (unadjusted sampling rate)
n = np.array([ np.random.choice(d, int(len(d) * a_with), replace=True).mean() for _ in range(1000) ] )
# with replacement (adjusted sampling rate)
n = np.array([ np.random.choice(d, int(len(d) * a_without), replace=True).mean() for _ in range(1000) ] )
It is apparent that, generally, replacement introduces variance, also that the formula from the paper correctly adjusts for that variance.
The following graph shows as a function of . The more of the overall data we sample without replacement, the more samples with replacement are required to get to the same results.
import matplotlib.pyplot as plt
import pandas as pd
pd.DataFrame([ (x, x / (1-x)) for x in np.arange(0,0.9,0.01)], columns=["a_without","a_with"]).plot(x=0, y=1)
Note on random forests
For random forests, generally, the concept of replacement is considered essential. This is because the underlying concept of random forests is bagging to prevent overfitting, i.e., bagging builds an ensemble of estimators trained on data with a high variance (with regard to the training data they have seen).
In particular, random forests sample with replacement as many samples as present in the training data, that is, . According to the paper, this is equivalent to in order to achieve the same variance as sampling with replacement. I haven’t tried this, but is sounds plausible. However, reducing the number of samples to train on by half, may get us in trouble if we only have a small number of samples to train on to begin with. This is because the individual estimators may not have enough data to learn anything.
Finally, if we trained random forests on data sampled without replacement, i.e., , this would mean that we have no variance in the data used to train the individual estimators of the ensemble. Thus the ensemble will become meaningless. By the way, if we wanted to have a matching , would approach infinity which is hinted on in the plot above and as can be seen from the corresponding formula.
Thanks to Florian Lemmerich for working on this with me 🙂