Last time we talked about batch processing of data with Hadoop and MapReduce. This time we'll look at Storm, a system more geared towards real-time processing of continuous streams of data.
Storm is ideally suited to processing an ongoing stream of data, for example the Twitter Firehose. The data is represented as streams of tuples (objects with a fixed list of named fields of potentially different types).
An application in storm primarily consists of:
- One or more spouts which provide a stream of tuples from an external source.
- Zero or more bolts which subscribe to one or more streams and in turn provide zero or more streams as output.
- A topology, which describes what components (spouts and bolts) are present, the degree to which each component is parallelized, what streams each bolt is subscribed to and how those subscriptions are distributed (or grouped) across the instances of that bolt.
A Simple Example
Using the example of the Twitter Firehose, each tweet might be emitted by the spout as a tuple which could include the list of hashtags in the tweet as one of the elements.
A bolt might then separate out the individual hashtags and emit a stream of tuples containing the hashtags tweeted. Since the emission of the hashtags for a given tweet does not require knowledge of any other tweet that bolt's subscription to the spout could be randomly grouped.
A second bolt might receive the hashtags and emit a running count of the number of times that hashtag has been tweeted in the last 5 minutes. Since this bolt would need to "remember" the number of tweets for each hashtag it could be grouped by field, where the field is the hashtag, allowing the number of bolts to increase to handle increased load while ensuring that any given hashtag would always be sent to the same bolt.
A final bolt might keep a list of the most tweeted hashtags and store that list in a database. It would have to be a single bolt as it would need to see all the counts emitted by the previous bolt in order to decide whether a new count would change the current top list, but it's memory consumption would be limited to the top 10 hashtags and their latest counts.
Notes: This type of infrastructure would be quite scalable, and with some minor optimizations could be made more so. There are, however, some limitations. It is difficult to trigger some piece of work when a given tweet is completely finished processing, this can be done via coordination, but that can be a fairly complex beast.
Conclusions
Storm provides a great way to analyze data that is never ending particularly where the results of the analysis must be low latency. Looking back at my own experiences I can see several places, particularly in fraud detection systems, where having infrastructure like Storm could have been a serious boon to our work. Storm, like Hadoop, has also been used to build abstractions which are more natural for the problem spaces where it is frequently used. Trident is built and maintained by the Storm team, and Summingbird provides a layer of abstraction over both Storm and Hadoop, allowing an appropriate algorithm to be run both in streaming and batch modes via the different infrastructures. Both Hadoop and Storm provide strengths, and efficiencies, where the problem fits their model, and limitations where it doesn't.
Next time we'll have a look at Akka, an actor framework written in Scala.
No comments :
Post a Comment