Wednesday 26 December 2012

Hadoop, the MapReduce Paradigm and the Corresponding Mindset Change

The latest rage in the Java universe is Hadoop, a platform for "Big Data" processing. A skew of O'Reilly books have been published on the subject. It implements the MapReduce paradigm popularized by Google.

MapReduce is something inspired by LISP. A Map takes a function and a list of arguments and applies the function to each one of the arguments. A Reduce operation combines data points into one value using a binary operation.

So a "MapReduce" operation consists of 1) applying a function to a list, to generate a new list, 2) combine the elements of the list, using some binary operator (which could be simple addition, or something more complex, like XOR) to produce a single value.  Many algorithms can be expressed using this paradigm.

You can see how this can speedup parallelisable tasks. E.g. A word count on a huge file can be mapped onto different machines and results collated via Reduce.

It is a simple divide and conquer model for processing data in a parallel fashion.

It is the model used by Google to achieve massively parallel processing.

This type of computing, though, requires an altogether different mindset. Whereas previously we were designing programs for single machines, or rather single processor machines, we now need to create algorithms that work in the multiprocessor/multimachine context - parallel algorithms.