Etienne Chauchot

Results 15 comments of Etienne Chauchot

> I took a glance on this change and LGTM for me. Taking into account that this PR really improves the performance of some transforms while running it on Spark...

> I agree, that leaves room for potential new confusion. Giving this a 2nd thought I suppose you're right and `SparkDatasetRunner` is the better name with less ambiguity ... nevertheless...

@mosche: did you rebase this PR on top of the previous merged code about the Encoders? I have the impression it contains the same changes ?

> oh, I remember ... you mean this one #22157? Yes, that's rebased ... but obviously this one here contains lots of changes to encoders to use encoders that are...

> ![results](https://user-images.githubusercontent.com/1401430/184098877-4972debd-4eba-4ade-a613-ace1d464a4fe.png) @aromanenko-dev I think you should also run the TPCDS suite on this PR (ask @aromanenko-dev ) because when we compared the 2 spark runners in the past we've...

> We can run it on Jenkins against this PR, if needed. @mosche did you manage to run TPCDS suite on this PR ?

I see that Nexmark query 5 and 7 have improved quite a lot. They are mainly based on combiners and windows. Nice !

> Alternatively you could run the load tests for combiners and GBK available in sdk/testing they are per transform

I agree with @kennknowles. I took a look as well and it seems that DStream Spark runner uses only a global watermark updated when the microbatch ends instead of when...