Processing Data

Continuing from the previous post

Professor X and I ended up talking about "Big Data".  Over the last few years, advances in hardware (particularly in storage) have made it possible, and more importantly practical, to store larger and larger quantities of data. Our ability to process data using the same old techniques  has not kept up with that curve, memory size and CPU speed have increased, but not nearly at the rate of storage.

This has led to a pair of evolutions in software systems, one changing the way we store data, and the second affecting the way we process data. This article will talk about the latter.

In the last 6 months or so, I've been exposed to 3 different systems which could be used as the underpinnings of a massively parallel architecture: Hadoop Map Reduce, Apache Storm, and Akka.

These systems are interesting, both in their similarities and their differences.

A really brief overview:
  • Map Reduce jobs use a large number of independent tasks called mappers to process individual pieces of data, the output from the mappers are then grouped based on some key and fed to reducers which process the set of records for a given key.
  • Storm provides an idiom where a component (bolt) can subscribe to stream(s) of tuples from other components (bolts or spouts), and can produce streams of tuples which can be subscribed to.
  • Akka is an actor framework, providing semantics for passing messages between components (actors).
The similarities

All these systems effectively break down your application into components, each performing a relatively limited task on one small piece of the data at a time. The systems also handle communication between your components, providing an abstraction over the location of those components. 

The individual components don't have to worry about parallelism since they only ever execute in a single thread at a time.

 On to the differences:

Map reduce jobs assume that all of the data required for the process is available from the start, this allows calculations that require a view of the entire data set. 

Storm assumes that the data is a never ending stream, which allows results to reflect the current state of the data, but makes it difficult to perform calculations which depend on knowing when all the data is available.

Akka could be used to implement either of these models, or something else entirely. Of the three it is the only one that supports 2 way communication. Of course that flexibility comes with the cost of implementing many of the details which are abstracted away by the others.

My thoughts:

In the end, I think we've started down the right path. The average developer could never keep a multi-threaded system in their head as a state machine, I know I certainly couldn't. Separating the concern of managing interaction between threads, and providing a simple, clean abstractions for the communication between tasks is a much more natural way to approach the problem of dealing with concurrency.

Over the coming weeks I hope to post in more detail about each of these tools, and my experiences with them.

No comments :