Large-Scale Machine Learning
In many domains, data now arrives faster than we are able to learn from it. To avoid wasting this data, we must switch from the traditional "one-shot" machine learning approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. We have identified a set of desiderata for such systems, and developed an approach to building stream mining algorithms that satisfies all of them. The approach is based on explicitly minimizing the number of examples used in each learning step, while guaranteeing that user-defined targets for predictive performance are met. So far, we have applied this approach to four major (and widely differing) types of learner: decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. Our versions of these algorithms are able to mine orders of magnitude more data than the best previous algorithms (e.g., our decision tree learner can mine on the order of a billion examples per day on an ordinary PC). We are currently applying our approach to the difficult problem of large-scale relational learning, and have already obtained an order-of-magnitude speedup on a Web prediction task. We have released a beta version of the VFML toolkit with our current suite of stream mining algorithms. Our ultimate goal is to develop a set of primitives (or, more generally, a language) such that any learning algorithm built using them scales automatically to arbitrarily large data streams.
Publications
|
Abductive Markov Logic for Plan Recognition Parag Singla and Raymond J. Mooney AAAI Conference on Artificial Intelligence, 2011. Full Paper (PDF) |
|
|
Sum-Product Networks: A New Deep Architecture Hoifung Poon and Pedro Domingos Uncertainty in Artificial Intelligence, 2011. Full Paper (PDF) |
|
Mining massive relational databases Geoff Hulten, Pedro Domingos and Yeuhi Abe International Workshop on Statistical Relational Learning, 2003. Workshop Paper (PDF) |
|
Learning from Infinite Data in Finite Time Pedro Domingos and Geoff Hulten Annual Conference on Neural Information Processing Systems, 2002. Full Paper (PDF) |
|
Mining Complex Models from Arbitrarily Large Databases in Constant Time Geoff Hulten and Pedro Domingos Knowledge Discovery and Data Mining, 2002. Full Paper (PDF) |
|
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering Pedro Domingos and Geoff Hulten International Conference on Machine Learning, 2001. Full Paper (PDF) |
|
Mining Time-Changing Data Streams Geoff Hulten, Pedro Domingos and Laurie Spencer Knowledge Discovery and Data Mining, 2001. Full Paper (PDF) |
|
Mining high-speed data streams Geoff Hulten and Pedro Domingos Knowledge Discovery and Data Mining, 2000. Full Paper (PDF) |
