Florian Jetter comments

Results 376 comments of


                                            Florian Jetter

Track unknown shapes

We've done a similar thing with DataFrames where we're tracking shuffle operations. That's much easier to do once the expression system is live

Warn (and eventually raise) when client.scatter is used with Active Memory Manager enabled

> Using scatter is generally not a good idea anymore and doesn't have any effect if the active memory manager is enabled. That's only partially true. What doesn't have an...

Warn (and eventually raise) when client.scatter is used with Active Memory Manager enabled

> Is delayed generally better or is that incorrect? In 9 out of 10 times it is better. The difference between the two approaches is that scatter can take a...

[Python][Parquet] Parquet deserialization speeds slower on Linux

I ran a couple of pyspy benchmarks on pure `pq.read_table` downloading from S3. I ran two tests, one with column projection and one with bulk reading. Both show basically the...

[Python][Parquet] Parquet deserialization speeds slower on Linux

Sorry, I just realize that my comment is also slightly off topic. The OP discusses pure deserialization without S3 in between

[Python][Parquet] Parquet deserialization speeds slower on Linux

FWIW I slightly modified the above script to run each operation N times since I noticed quite some variance on my machine (M1 2020 MacBook) ```python # Create dataset import...

[Python][Parquet] Parquet deserialization speeds slower on Linux

@jorisvandenbossche have you used the same conda forge build for your measurements or did you build it yourself? It would be nice to rule out any build differences

[Python][Parquet] Parquet deserialization speeds slower on Linux

Ok, fun experiment. I wrapped the above script in a function `run_benchmark` and ran this on my machine... ![image](https://github.com/apache/arrow/assets/8629629/a69c8e8b-d005-4b94-a224-75f01200cf4a) Looks like the simple fact that we're running this in the...

[Python][Parquet] Parquet deserialization speeds slower on Linux

Is pyarrow using either one of `OMP_NUM_THREADS`, `MKL_NUM_THREADS`, `OPENBLAS_NUM_THREADS` to infer how large the threadpool is allowed to be? Edit: Looking at the code base, I see references and documentation...

[Python][Parquet] Parquet deserialization speeds slower on Linux

> Yes, it seems we are using OMP_NUM_THREADS (and otherwise check std::thread::hardware_concurrency(), which I think also doesn't always give the correct number, eg in a container), see the relevant [code](https://github.com/apache/arrow/blob/37935604bf168a3b2d52f3cc5b0edf83b5783309/cpp/src/arrow/util/thread_pool.cc#L705-L721)....