Hadoop MapReduce

Last time, I talked about three platforms for building highly scalable, concurrent, applications. Today I'm going to do a deeper dive into Hadoop MapReduce.
MapReduce jobs are structured based on several stages.
  • Input is read and split into Key & Value records. In Hadoop this responsibility is handled by the InputFormat implementation
  • Each Key/Value pair is passed to an implementation of a Mapper. The Mapper accepts a single Key/Value pair of its input types and produces zero or more Key/Value Pairs of its output types. Expressed as a Scala function:
  • The output from the mappers is grouped by key and passed to the reducers (this step is implemented by the platform)
  • For each key a reducer is run, it is passed in the key and an Iterable of the values associated with that key. The reducer, expressed as a Scala function would look something like:
        (INPUT_KEY, Iterable[INPUT_VALUE]) => Collection[(OUTPUT_KEY, OUTPUT_VALUE)]
  • The output from the Reducer is written to the output medium using an implementation of OutputFormat
MapReduce jobs provide a mechanism for distributing the processing of large quantities of data across many systems. They are often chained together to break down a larger task into individual steps. The basic MapReduce API, however, requires a substantial amount of boilerplate. It also doesn't provide much of an abstraction over the internal behavior of the system. Several efforts have been made to provide a more natural way to make use of the platform, Pig, Cascading, Scalding and SummingBird to name a few.

MapReduce jobs provide a very effective means of processing large amounts of data where the latency of the results isn't critical. Their use in BigData applications is pretty ubiquitous; understanding how they work, provides a useful point of comparison when looking at the other, more recent, additions to the BigData scene.

Next time I'll talk about Storm and its streaming model for dealing with large amounts of data.

No comments :