MapReduce jobs are structured based on several stages.
- Input is read and split into Key & Value records. In Hadoop this responsibility is handled by the InputFormat implementation
- Each Key/Value pair is passed to an implementation of a Mapper. The Mapper accepts a single Key/Value pair of its input types and produces zero or more Key/Value Pairs of its output types. Expressed as a Scala function:
(INPUT_KEY, INPUT_VALUE) => Collection[(OUTPUT_KEY, OUTPUT_VALUE)]
- The output from the mappers is grouped by key and passed to the reducers (this step is implemented by the platform)
- For each key a reducer is run, it is passed in the key and an Iterable of the values associated with that key. The reducer, expressed as a Scala function would look something like:
type Mapper[INPUT_KEY, INPUT_VALUE, OUTPUT_KEY, OUTPUT_VALUE] = (INPUT_KEY, Iterable[INPUT_VALUE]) => Collection[(OUTPUT_KEY, OUTPUT_VALUE)]
- The output from the Reducer is written to the output medium using an implementation of OutputFormat
MapReduce jobs provide a very effective means of processing large amounts of data where the latency of the results isn't critical. Their use in BigData applications is pretty ubiquitous; understanding how they work, provides a useful point of comparison when looking at the other, more recent, additions to the BigData scene.
Next time I'll talk about Storm and its streaming model for dealing with large amounts of data.
No comments :
Post a Comment