Wednesday, 26 December 2012

Hadoop, the MapReduce Paradigm and the Corresponding Mindset Change

The latest rage in the Java universe is Hadoop, a platform for "Big Data" processing. A skew of O'Reilly books have been published on the subject. It implements the MapReduce paradigm popularized by Google.

MapReduce is something inspired by LISP. A Map takes a function and a list of arguments and applies the function to each one of the arguments. A Reduce operation combines data points into one value using a binary operation.

So a "MapReduce" operation consists of 1) applying a function to a list, to generate a new list, 2) combine the elements of the list, using some binary operator (which could be simple addition, or something more complex, like XOR) to produce a single value.  Many algorithms can be expressed using this paradigm.

You can see how this can speedup parallelisable tasks. E.g. A word count on a huge file can be mapped onto different machines and results collated via Reduce.

It is a simple divide and conquer model for processing data in a parallel fashion.

It is the model used by Google to achieve massively parallel processing.

This type of computing, though, requires an altogether different mindset. Whereas previously we were designing programs for single machines, or rather single processor machines, we now need to create algorithms that work in the multiprocessor/multimachine context - parallel algorithms.

Saturday, 14 July 2012

The Art of Backporting: Using Java 5 features in Java 1.4

Backporting is the application of a patch or fix to a version of an application software which precedes that for which the patch was designed. One example would be the use of Java 1.5 features in Java 1.4. Basically your 1.5 code is used to generate bytecode that is compatible with a 1.4 VM.

Take for example generics which allows us to define objects with types as parameters, such as  List of Strings. Due to Java's implementation of generics using type erasure - whereby type information is in fact wiped out in the compilation phase - it is fairly easy to generate Java 1.4 compatible bytecode from 1.5 generics-based code. Victory to backporting!

Saturday, 16 June 2012

Adding External Jars (aka "Archives") to an Eclipse Project

Suppose you need to include some external libraries in your Java build path. Right click the project, click on Build Path and then "Add External Archives". Presumably you've created a directory in your project folder that holds all these external JARs.

Awesome IO with Apache Commons

Awesome IO with Apache commons starts with org.apache.commons.io. Among the most useful aspects of this package are the static utility methods provided by IOUtils.

Friday, 15 June 2012

Hacking apache codecs

Apache commons codec was developed initially to provide a definitive implementation of a Base64 (binary data in ASCII) encoder in Java. Base64 is used in MIME attachments.

Eclipse Is Great

Eclipse is a great Java IDE that must be celebrated. Among the cool features:

1. Automatic strikethough of class names where the class is deprecated

Hacking apache collections

Once you get the jar added to your project, you can start cruising the javadoc and seeing what you can make of the Apache collections classes (and interfaces, mind you).

This will take you a few steps beyond the humble plateau of plain-vanilla Maps and Lists, by introducing variations of Sets and different types of Maps, designed for special scenarios. But before you ascend to the Java data structure stratosphere, you might want to recap your knowledge on the basic interfaces in the Collections framework, such as the Map interface, and the List interface. An interesting point of comparison, is that while the Map interface has no superinterfaces, the List interface is superinterface'd by the Collection interface and the Iterable interface by extension, since Collections are by necessity Iterable.

Perhaps, you are interested in the numerous implementations of Bags (sets that allow repeat elements e.g. two pairs of socks in a gym bag), and wander over to TreeBag or HashBag (now deprecated). Also fascinating are the various implementations of the BidiMap interface such as TreeBidiMap, that allows lookup from keys to values and values to keys- with equal efficiency. It is more storage efficient than using two separate TreeMaps, although the DualTreeBidiMap does use this approach.

Remember always the distinction between interfaces and classes, do not try to create an object of type Bag by instantiating Bag, but rather create an instance of TreeBag (but not HashBag) i.e. instantiate the implementing classes (but not deprecated ones!)

Wednesday, 29 February 2012

The Road of JSR (Java Specification Request)

Unhappy with Java APIs as they stand? File a JSR (Java Specification Request). An example would be the Money and Currency API (JSR 354). Filing JSRs is part of the Java Community Process.