Ian Rose

Results 305 comments of Ian Rose

Dask makes the (often, but not always, accurate) assumption that an "object" dtype is a string when writing parquet datasets. If you are doing tricky things like serializing dicts and...

I should also note: if you want to just let pyarrow take a crack at serializing things, throwing schema-matching caution to the wind, you can pass `to_parquet(schema=None)`

What is the threshold at which it currently makes this choice? Is it relative or absolute? From my perspective this cleverness is absolutely not worth it, and I'd much rather...

Okay, so it's a factor of two, and relative. I'm just restating what you've already implied, but that factor of two can be extremely consequential for an NDarray, since it...

(I also have my knives out for `normalize_chunks` so this fits in nicely)

@jakirkham is this something you are still interested in pursuing? I wasn't even aware this was an option, though I agree with @GenevieveBuckley that `chunksize` is a bit overloaded and...

Thanks for the report @Andreas5739738! I can confirm your issue. I believe what is happening here is that spark replaces null values in a partitioned column with the magic string...

Huh, strangely I'm having a hard time reproducing this (though my traceback suggests I wasn't imagining things). But with a tweak I start seeing the issue again: ```python import dask.dataframe...

> (2) move all array-creation logic into a dedicated module - like dask.array.creation. This seems okay to me. To clarify: most of the important array creation modules are in `dask.array.core`....

>Yes, that is pretty much what I am proposing. However, I'd probably prefer to do it in follow-up work (to keep this PR as simple as possible). Makes sense. I...