sparklens icon indicating copy to clipboard operation
sparklens copied to clipboard

Spark streaming support

Open Indu-sharma opened this issue 6 years ago • 9 comments

Sparklens seem to be waiting for the job to complete, however it doesn't seem to be working with the Spark streaming jobs. Do you have any plan to fix it ?

Indu-sharma avatar Jun 13 '18 11:06 Indu-sharma

Here is one way to get it working with streaming job. I haven't tried it with streaming yet. Let me know if this serves your purpose.

  1. Start your application with --packages qubole:sparklens:0.1.2-s_2.11 but don't specify the extraListener config.
  2. As part of your application, do the following:
import com.qubole.sparklens.QuboleNotebookListener
val QNL = new QuboleNotebookListener(sc.getConf)
sc.addSparkListener(QNL)

Basically, create a listener(note that this is Notebook listener and not JobListener) and register it. 3. within your streaming function (whatever is repeatedly called), wrap your code in the following:

QNL.profileIt {
    //Your code here
}

Alternatively, if you need more control:

if (QNL.estimateSize() > QNL.getMaxDataSize()) {
  QNL.purgeJobsAndStages()
}
val startTime = System.currentTimeInMillis
<-- Your scala code here -->
endTime = System.currentTimeInMillis
//wait for some time to get all events to accumulate 
Thread.sleep(QNL.getWaiTimeInSeconds())
println(QNL.getStats(startTime, endTime))
  1. Checkout https://github.com/qubole/sparklens/blob/master/src/main/scala/com/qubole/sparklens/QuboleNotebookListener.scala for more information.

thanks!

iamrohit avatar Jun 13 '18 13:06 iamrohit

We have tried using QuboleJobListener for structured streaming , but as @Indu-sharma and others mentioned it will only provide reports after terminating the streaming query and also it provides for all the Jobs together (not batch wise)

But in general, as these Structured streaming applications are continuously running, users/developers will be interested to see stats for every few batches.

Detailed proposal is attached as below. Please review and provide your inputs.

Structured_streaming_sparklens.pdf

akumarb2010 avatar Jun 14 '18 21:06 akumarb2010

@akumarb2010 : I took a look at the proposal. Yes, you are perfectly right in saying that we need to emit stats every k-batch and that helps identify the streaming jobs buttlenecks over the period of time(batches). However, i'm not sure if you are considering the sizing estimates after every stats emitted.

Indu-sharma avatar Jun 15 '18 03:06 Indu-sharma

@akumarb2010 This is a good start! I believe one of the key questions for streaming jobs would be: if my input data rate increases, can I still meet the SLA by adding more compute resources? One question: How much is the overlap/difference between SparkListener & StreamingListener? Do we need to combine information from both or just one of them is sufficient? It would be nice to capture what data is available, what needs to be aggregated and when, bit of details on main computations and how to control the frequency of output. Other concern would be how much data we need to keep in memory for all this to work. We need to keep a bound or provide a tradeoff between memory and accuracy. One again, thanks for taking this up. CC: @itsvikramagr

iamrohit avatar Jun 15 '18 04:06 iamrohit

@iamrohit, @akumarb2010 - Yes you are right. Apart from handling better cluster resources, the most important task is to manage the streaming pipeline. One of the key metrics to consider is batch latency and its correlation with input rows. We cannot afford to have batch latency to be consistently more than batch trigger duration and some corrective measures are required.

StreamingQueryListener is very specific to Structured streaming and there is no overlap with SparkListner. We would need to combine information from both.

itsvikramagr avatar Jun 15 '18 04:06 itsvikramagr

Awesome! Thanks for taking this up. Lets discuss more once you are ready with the PR.

iamrohit avatar Jun 18 '18 04:06 iamrohit

@iamrohit , @itsvikramagr & @Indu-sharma Thanks for your inputs.

Started working on this feature. Will update you once, we are done with initial PR.

akumarb2010 avatar Jun 19 '18 22:06 akumarb2010

@akumarb2010 : Could you please update on this if any ? Thank you very much.

Indu-sharma avatar Sep 24 '18 01:09 Indu-sharma

@Indu-sharma @akumarb2010 You can check out our new project Streaminglens if you plan to use Sparklens for Streaming applications.

abhishekd0907 avatar Jan 27 '20 03:01 abhishekd0907