streaming-demo
streaming-demo copied to clipboard
A Spark Streaming demo framework that implements and improves the functions of Twitter Rainbird
streaming-demo
A Spark Streaming demo framework is a framework for streaming data counting and aggregating, like Twitter Rainbird, but has more feature compared to it.
- it has several kinds of operators, not only counter operator (like in Rainbird).
- it can process different topics of Kafka message with topic specific parser in one framework.
- it is highly configurable with XML file.
The whole architecture of streaming demo is like this:

The UML class chart is:

Users want to use this framework should configure the XML file conf/properties.xml, like clickstream examples
<applications>
<application>
<category>clickstream</category>
<parser>example.clickstream.ClickEventParser</parser>
<items>
<item>source_ip</item>
<item>dest_url</item>
<item>visit_date</item>
<item>ad_revenue</item>
<item>user_agent</item>
<item>c_code</item>
<item>l_code</item>
<item>s_keyword</item>
<item>avg_time_onsite</item>
</items>
<properties>
<property window="30" slide="10" hierarchy="/" type="count">
<key>dest_url</key>
<output>stream.framework.output.TachyonStrToLongOutput</output>
</property>
<property window="30" slide="10" hierarchy="/" type="aggregate">
<key>dest_url</key>
<value>source_ip</value>
<output>stream.framework.output.TachyonStrToStrOutput</output>
</property>
<property window="30" slide="10" hierarchy="/" type="distinct_aggregate_count">
<key>dest_url</key>
<value>source_ip</value>
<output>stream.framework.output.TachyonStrToLongOutput</output>
</property>
</properties>
</application>
</applications>
Here several application can co-exists in one applications, user can configure application specific parameter like above:
categoryis the category of Kafka messge, also use this as the topic.parseris the user defined parser class class, user should extendsAbstractParserto self-defined ones.itemsis the schema of input message, in case input message has several items, this is the name of each item.propertiesis the one you want to operate, you could specify several kinds of operators with some properties
Currently framework supports 3 operators for user:
CountOperator:CountOperatorwill count the occurrence of specifickey, like PV (page views), here are several parameters related toCountOperator,-
window: specify the timing window for Spark Streaming to collect data and calculate -
slide: specify the sliding parameter for this window, take10as example, Spark Streaming will process data in each 10 seconds for 30 seconds window data. -
hierarchy:hierarchymeans if you want to delimit data using specified delimiter, like/. As for page view analysis, each url should split like this:http://xyz.com/aaa/bbb/ccc => xyz.com/aaa/bbb/ccc xyz.com/aaa/bbb xyz.com/aaa xyz.com -
type: specify the operator you choose. -
output: specify the output class you implemented to store the output data in this kind of data.outputshould extendAbstractEventOutput.
-
AggregateOperator:AggregateOperatorwill aggregate thevaluebykeywhich you specified, and the parameters of this operator is the same asCountOperator.DistinctAggregateCountOperator:DistinctAggregateCountOperatorwill count the distinctvaluewith specifiedkey, also parameters is the same as above.- besides, user can create their own
Operatorby extendsAbstractOperator, which is easy and obvious.