Bum Chul Kwon, Janu Verma, et al.
IEE CG&A
To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporallybiased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying exponentially over time. We then periodically retrain the models on the current sample. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoirbased scheme (R-TBS) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequalprobability sampling schemes. We discuss distributed implementation strategies; experiments in Spark show that our approach can increase machine learning accuracy and robustness in the face of evolving data.
Bum Chul Kwon, Janu Verma, et al.
IEE CG&A
Rainer Gemulla, Peter J. Haas, et al.
VLDB Journal
Peter J. Haas, Yushan Liu, et al.
Biometrics
Anish Das Sarma, Ander de Keijzer, et al.
Dagstuhl Seminar Proceedings 2008