Etienne Chauchot
Etienne Chauchot
> I took a glance on this change and LGTM for me. Taking into account that this PR really improves the performance of some transforms while running it on Spark...
> I agree, that leaves room for potential new confusion. Giving this a 2nd thought I suppose you're right and `SparkDatasetRunner` is the better name with less ambiguity ... nevertheless...
@mosche reviewing ... cc: @aromanenko-dev
@mosche: did you rebase this PR on top of the previous merged code about the Encoders? I have the impression it contains the same changes ?
> oh, I remember ... you mean this one #22157? Yes, that's rebased ... but obviously this one here contains lots of changes to encoders to use encoders that are...
>  @aromanenko-dev I think you should also run the TPCDS suite on this PR (ask @aromanenko-dev ) because when we compared the 2 spark runners in the past we've...
> We can run it on Jenkins against this PR, if needed. @mosche did you manage to run TPCDS suite on this PR ?
I see that Nexmark query 5 and 7 have improved quite a lot. They are mainly based on combiners and windows. Nice !
> Alternatively you could run the load tests for combiners and GBK available in sdk/testing they are per transform
I agree with @kennknowles. I took a look as well and it seems that DStream Spark runner uses only a global watermark updated when the microbatch ends instead of when...