Ian Rose comments

Results 305 comments of


                                            Ian Rose

Provide `DataFrame.to_pickle`

Dask makes the (often, but not always, accurate) assumption that an "object" dtype is a string when writing parquet datasets. If you are doing tricky things like serializing dicts and...

Provide `DataFrame.to_pickle`

I should also note: if you want to just let pyarrow take a crack at serializing things, throwing schema-matching caution to the wind, you can pass `to_parquet(schema=None)`

Chunks with size literals (`"20 MiB"`) can result in significantly different chunk sizes than requested

What is the threshold at which it currently makes this choice? Is it relative or absolute? From my perspective this cleverness is absolutely not worth it, and I'd much rather...

Chunks with size literals (`"20 MiB"`) can result in significantly different chunk sizes than requested

Okay, so it's a factor of two, and relative. I'm just restating what you've already implied, but that factor of two can be extremely consequential for an NDarray, since it...

Chunks with size literals (`"20 MiB"`) can result in significantly different chunk sizes than requested

(I also have my knives out for `normalize_chunks` so this fits in nicely)

Local scheduler parameter `chunksize`

@jakirkham is this something you are still interested in pursuing? I wasn't even aware this was an option, though I agree with @GenevieveBuckley that `chunksize` is a bit overloaded and...

Failure loading Parquet dataset with None values in partitoned column

Thanks for the report @Andreas5739738! I can confirm your issue. I believe what is happening here is that spark replaces null values in a partitioned column with the magic string...

Cannot slice a string index

Huh, strangely I'm having a hard time reproducing this (though my traceback suggests I wasn't imagining things). But with a tweak I start seeing the issue again: ```python import dask.dataframe...

Backend library dispatching for IO in Dask-Array and Dask-DataFrame

> (2) move all array-creation logic into a dedicated module - like dask.array.creation. This seems okay to me. To clarify: most of the important array creation modules are in `dask.array.core`....

Backend library dispatching for IO in Dask-Array and Dask-DataFrame

>Yes, that is pretty much what I am proposing. However, I'd probably prefer to do it in follow-up work (to keep this PR as simple as possible). Makes sense. I...