2012 Winner: A Declarative Approach for Machine Learning at Scale

Project Information
A Declarative Approach for Machine Learning at Scale
Engineering
Independent Study
In the Spring quarter of my Junior year, I began working with Prof. Neoklis Polyzotis on a research project related to large-scale data processing. This project grew into a collaboration with teams from UC Irvine and Yahoo! Research. This work has the potential to improve the productivity of researchers who analyze large volumes of data, such as biological data, massive social network graphs, or the Web, and may yield significant performance and efficiency improvements for existing data processing workflows.

My research focuses on developing large-scale data processing frameworks that support recursion and iteration. In many fields, there is an increasing need to process and analyze massive data sets using large computer clusters. Many interesting computations involve recursion or iteration, which are difficult to express using existing tools. One example is computing the transitive closure of a social network graph in order to discover communities or central nodes. Another example is the class of iterative machine learning algorithms, such as gradient descent, where each iteration reads as input the data and a "model" of the data generated from the previous iteration, and outputs a refined model for the subsequent iteration.

Existing systems, such as Apache Hadoop and Giraph, are meant to be industrial-strength but do a poor job of expressing these iterative and recursive algorithms. Currently, programmers must glue together several different tools and frameworks or use custom-built systems tailored to specific algorithms. This results in long development cycles, limiting researchers' productivity, and the resulting code may be inefficient due to missed optimization opportunities. This is especially problematic in machine learning: the high cost of implementing new ML algorithms on clusters means that algorithms are tested on single machines during early development phases, resulting in a slow feedback-loop when testing with large data sets.

As a step towards addressing these issues, my research explores extensions to the Hyracks data-parallel computing platform to express recursive and iterative algorithms. Hyracks, developed at UC Irvine, is a framework for large-scale processing of massive data sets. Hyracks models distributed computations as networks of operators, where each operator performs some computation over its input and then passes its output to other operators. Hyracks distributes this data flow network and manages communication between operators that may run on different nodes.

For two quarters, I explored extensions to Hyracks to express recursive algorithms in an asynchronous fashion. This approach is useful because many recursive algorithms, including transitive closure, do not require their execution to be performed in discrete rounds. This research has uncovered several interesting problems, which I plan to investigate during my final quarter at UCSC.

For the past two quarters, I worked on supporting iteration within Hyracks. I developed a framework for expressing Iterative Map-Reduce-Update algorithms, which correspond to a class of machine learning tasks that are commonly used in practice, and benchmarked it against other systems using a Yahoo!Research cluster and real-world data sets. This research is described in two papers:

"Scaling Datalog for Machine Learning on Big Data", by Yingyi Bu, Vinayak Borkar, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan, argues for the use of the Datalog query language as a logical layer for expressing machine learning algorithms. In a nutshell, the idea is to express a machine learning task as a logical query over the database of training data. In turn, it becomes possible to apply well-known query optimization techniques in order to generate an efficient execution plan for the machine learning task.

We demonstrate how two common classes of machine learning tasks, Pregel and Iterative Map-Reduce-Update, can be represented in Datalog. Using optimized execution plans derived from the Datalog queries, we benchmark our implementations by performing experiments on a Yahoo! Research cluster. Our approach provides a general framework that matches the performance of specialized machine learning tools, and outperforms other general frameworks by orders of magnitude. This paper was submitted to PVLDB; an extended technical report is available online at http://arxiv.org/abs/1203.0160.

"Iterative MapReduce for Large Scale Machine Learning", by Joshua Rosen, Neoklis Polyzotis, Vinayak Borkar, Yingyi Bu, Michael J. Carey, Markus Weimer, Tyson Condie, and Raghu Ramakrishnan, takes a deeper look at the execution plan for the Iterative Map-Reduce-Update programming model. It formally defines the class of Iterative Map-Reduce-Update algorithms, describes existing systems that support them, and describes the construction of an optimized Iterative Map-Reduce-Update execution plan in Hyracks. We derive theoretical optimal values for some parameters of the execution plan and empirically validate our theory's predictions. Using the optimized Iterative Map-Reduce-Update plan, we benchmark our system by running batch gradient descent on a real-world data set and demonstrate that our implementation is competitive against a specialized, state-of-the-art system. This paper was submitted to KDD 2012.
Students
  • Joshua Daniel Rosen (Crown)
Mentors