streaming-demo
streaming-demo copied to clipboard
A Spark Streaming demo framework that implements and improves the functions of Twitter Rainbird
streaming-demo
A Spark Streaming demo framework is a framework for streaming data counting and aggregating, like Twitter Rainbird, but has more feature compared to it.
- it has several kinds of operators, not only counter operator (like in Rainbird).
- it can process different topics of Kafka message with topic specific parser in one framework.
- it is highly configurable with XML file.
The whole architecture of streaming demo is like this:
The UML class chart is:
Users want to use this framework should configure the XML file conf/properties.xml
, like clickstream
examples
<applications>
<application>
<category>clickstream</category>
<parser>example.clickstream.ClickEventParser</parser>
<items>
<item>source_ip</item>
<item>dest_url</item>
<item>visit_date</item>
<item>ad_revenue</item>
<item>user_agent</item>
<item>c_code</item>
<item>l_code</item>
<item>s_keyword</item>
<item>avg_time_onsite</item>
</items>
<properties>
<property window="30" slide="10" hierarchy="/" type="count">
<key>dest_url</key>
<output>stream.framework.output.TachyonStrToLongOutput</output>
</property>
<property window="30" slide="10" hierarchy="/" type="aggregate">
<key>dest_url</key>
<value>source_ip</value>
<output>stream.framework.output.TachyonStrToStrOutput</output>
</property>
<property window="30" slide="10" hierarchy="/" type="distinct_aggregate_count">
<key>dest_url</key>
<value>source_ip</value>
<output>stream.framework.output.TachyonStrToLongOutput</output>
</property>
</properties>
</application>
</applications>
Here several application
can co-exists in one applications
, user can configure application
specific parameter like above:
-
category
is the category of Kafka messge, also use this as the topic. -
parser
is the user defined parser class class, user should extendsAbstractParser
to self-defined ones. -
items
is the schema of input message, in case input message has several items, this is the name of each item. -
properties
is the one you want to operate, you could specify several kinds of operators with some properties
Currently framework supports 3 operators for user:
-
CountOperator
:CountOperator
will count the occurrence of specifickey
, like PV (page views), here are several parameters related toCountOperator
,-
window
: specify the timing window for Spark Streaming to collect data and calculate -
slide
: specify the sliding parameter for this window, take10
as example, Spark Streaming will process data in each 10 seconds for 30 seconds window data. -
hierarchy
:hierarchy
means if you want to delimit data using specified delimiter, like/
. As for page view analysis, each url should split like this:http://xyz.com/aaa/bbb/ccc => xyz.com/aaa/bbb/ccc xyz.com/aaa/bbb xyz.com/aaa xyz.com
-
type
: specify the operator you choose. -
output
: specify the output class you implemented to store the output data in this kind of data.output
should extendAbstractEventOutput
.
-
-
AggregateOperator
:AggregateOperator
will aggregate thevalue
bykey
which you specified, and the parameters of this operator is the same asCountOperator
. -
DistinctAggregateCountOperator
:DistinctAggregateCountOperator
will count the distinctvalue
with specifiedkey
, also parameters is the same as above. - besides, user can create their own
Operator
by extendsAbstractOperator
, which is easy and obvious.