Florian Jetter

Results 376 comments of Florian Jetter

We've done a similar thing with DataFrames where we're tracking shuffle operations. That's much easier to do once the expression system is live

> Using scatter is generally not a good idea anymore and doesn't have any effect if the active memory manager is enabled. That's only partially true. What doesn't have an...

> Is delayed generally better or is that incorrect? In 9 out of 10 times it is better. The difference between the two approaches is that scatter can take a...

I ran a couple of pyspy benchmarks on pure `pq.read_table` downloading from S3. I ran two tests, one with column projection and one with bulk reading. Both show basically the...

Sorry, I just realize that my comment is also slightly off topic. The OP discusses pure deserialization without S3 in between

FWIW I slightly modified the above script to run each operation N times since I noticed quite some variance on my machine (M1 2020 MacBook) ```python # Create dataset import...

@jorisvandenbossche have you used the same conda forge build for your measurements or did you build it yourself? It would be nice to rule out any build differences

Ok, fun experiment. I wrapped the above script in a function `run_benchmark` and ran this on my machine... ![image](https://github.com/apache/arrow/assets/8629629/a69c8e8b-d005-4b94-a224-75f01200cf4a) Looks like the simple fact that we're running this in the...

Is pyarrow using either one of `OMP_NUM_THREADS`, `MKL_NUM_THREADS`, `OPENBLAS_NUM_THREADS` to infer how large the threadpool is allowed to be? Edit: Looking at the code base, I see references and documentation...

> Yes, it seems we are using OMP_NUM_THREADS (and otherwise check std::thread::hardware_concurrency(), which I think also doesn't always give the correct number, eg in a container), see the relevant [code](https://github.com/apache/arrow/blob/37935604bf168a3b2d52f3cc5b0edf83b5783309/cpp/src/arrow/util/thread_pool.cc#L705-L721)....