streamz icon indicating copy to clipboard operation
streamz copied to clipboard

Streamz Benchmarking

Open CJ-Wright opened this issue 4 years ago • 19 comments

It would be great to have some benchmarks for streamz so we can understand our performance. Th custreamz group has done some awesome work on this front. @chinmaychandak any interest in putting in a PR with your benchmarks?

CJ-Wright avatar Aug 14 '20 14:08 CJ-Wright

Sure, I would love to add some benchmarks. It’s an awesome idea to get more people interested and try out streamz/custreamz.

Any specific ideas as to how we want to do this? Do we want CPU-only benchmarks or both CPU and GPU benchmarks? Do we want just the numbers or the actual datasets samples and the streamz processing code too?

cc: @jsmaupin, @satishvarmadandu

chinmaychandak avatar Aug 14 '20 16:08 chinmaychandak

I think in an ideal universe we could use asv but I'm not certain if I (personally) want to go through the effort to set it up. Short of that I think a bunch of scripts that can be run and output somewhere the time it takes them to run would be helpful. Maybe we can cobble it together into a table like thing so they can all be compared, potentially {streamz, streamz+dask, custreamz (as appropriate)}.

I'm happy to have both streamz and custreamz benchmarks, idk how much of the streamz API is covered by custreamz so there may be some streamz benchmarks without a custreamz equivalent and that's fine.

I think it would be valuable to have the numbers (which one of us runs every once in a while) and the pieces so users can generate it on their own, since I expect the numbers to be coupled to the exact compute env being used.

CJ-Wright avatar Aug 14 '20 17:08 CJ-Wright

I think a bunch of scripts that can be run and output somewhere the time it takes them to run would be helpful.

It would be difficult to put the scripts and data for the existing benchmarks used here, publicly, since they are being used in production here at NVIDIA. But we can definitely spin up something similar with dummy data and put the numbers here (streamz + Dask vs. custreamz + Dask) should be good IMO.

how much of the streamz API is covered by custreamz

custreamz is just using streamz + Dask with RAPIDS cuDF instead of Pandas. So functionality-wise, they are equally capable. A more important point is that there are use cases where GPUs excel over CPUs (refer to the chart in the Medium blog); we need to somehow showcase that here.

exact compute env

Currently, we are using NVIDIA Tesla T4 GPUs to benchmark custreamz. But, yes, we can use a standard AWS instance and put it alongside the benchmarks.

chinmaychandak avatar Aug 14 '20 18:08 chinmaychandak

@CJ-Wright are you aware of any open streaming data sets available to put together stream vs custreamz benchmarking ? Medium blog post has Nvidia prod data as mentioned by @chinmaychandak and not sure if we can use that raw data here

satishvarmadandu avatar Aug 14 '20 18:08 satishvarmadandu

It would be enough, I think, to run the benchmarks as one-offs at the time of each release (can github actions do something with a tag??) and update numbers in some MD file of the docs. Just being able to run any benchmark locally would be nice, so we can refer to values in a PR that seems to have a performance impact.

martindurant avatar Aug 14 '20 18:08 martindurant

btw: should we have a get together to talk about future plans for streamz in general?

martindurant avatar Aug 14 '20 18:08 martindurant

that would be great!

CJ-Wright avatar Aug 14 '20 18:08 CJ-Wright

should we have a get together to talk about future plans for streamz in general?

This is an absolutely great idea!

chinmaychandak avatar Aug 14 '20 18:08 chinmaychandak

(sorry, try again)

martindurant avatar Aug 14 '20 18:08 martindurant

OK, so the poll thing looks really bad; may I offer the coming Monday morning, Tuesday afternoon or Wednesday morning. I am pretty flexible on time (but I am in NAmerica east).

martindurant avatar Aug 14 '20 18:08 martindurant

Concerning the GitHub actions: yes, that is possible. I have done so for tsfresh using pytest benchmark and I have also written a small GitHub actions plugin to compare different versions. However, we have noticed that the running time fluctuates a fair amount, as the actions run "somewhere" in the cloud and we might share resources with other runs. If we would have access to dedicated hardware, that would be easier (as e.g. pandas does it). I am happy to help setting up the github pipeline if you want!

nils-braun avatar Aug 15 '20 05:08 nils-braun

may I offer the coming Monday morning, Tuesday afternoon or Wednesday morning.

@martindurant, @CJ-Wright, apologies for the delayed response. Can we plan to meet some time next week? Please let us know if any of the below times work for you.

Tuesday: 8-8:30am PT Tuesday: 8:30-9am PT Wednesday: 8-8:30am PT Wednesday: 8:30-9am PT

chinmaychandak avatar Aug 17 '20 17:08 chinmaychandak

The Tuesday times work for me.

jsmaupin avatar Aug 17 '20 18:08 jsmaupin

Would it be possible to meet this week? Sooner is better than later for me. I'm otherwise flexible for the day/time.

CJ-Wright avatar Aug 17 '20 18:08 CJ-Wright

Would it be possible to meet this week? Sooner is better than later for me. I'm otherwise flexible for the day/time.

I totally agree sooner the better, but we are currently working on moving custreamz into production here, so thought next week would be a little better for us. :)

chinmaychandak avatar Aug 17 '20 18:08 chinmaychandak

^ In that case, I agree with @chinmaychandak , since it sounds like you may have more to talk about if we wait the extra few days.

Tuesday works for me - I think I can make any of those times.

martindurant avatar Aug 17 '20 18:08 martindurant

Great, here's the invite!

streamz Meetup

Tuesday, Aug 25, 2020 8:30 am | 30 minutes | (UTC-07:00) Pacific Time (US & Canada) Meeting number: 145 662 5639 Password: streamz_nv_meet https://nvmeet.webex.com/nvmeet/j.php?MTID=m72d21b0fd416cc3a5690a5ccbee866e1

Join by video system Dial [email protected] You can also dial 173.243.2.68 and enter your meeting number.

Join by phone +1-415-655-0003 US Toll Access code: 145 662 5639

chinmaychandak avatar Aug 17 '20 19:08 chinmaychandak

@martindurant, @CJ-Wright Are we on track for tomorrow's meeting? Wanted to make sure everyone can make it.

chinmaychandak avatar Aug 24 '20 21:08 chinmaychandak

Will be there! We can discuss some of the PRs in flight, that I've not yet had a chance to read.

On August 24, 2020 5:47:43 PM EDT, Chinmay Chandak [email protected] wrote:

@martindurant, @CJ-Wright Are we on track for tomorrow's meeting? Wanted to make sure everyone can make it.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/python-streamz/streamz/issues/352#issuecomment-679383547

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

martindurant avatar Aug 24 '20 22:08 martindurant