Nic Crane
Nic Crane
@westonpace Reckon this is a bug?
Here's the query plan (the dataset has a lot of columns): ``` ExecPlan with 3 nodes: 2:SinkNode{} 1:ProjectNode{projection=[SPORDER, RT, SERIALNO, PUMA, ST, ADJUST, PWGTP, AGEP, CIT, COW, DDRS, DEYE, DOUT,...
Hmm, could be an R bug or something already solved actually; I ran the following (different query, but similarly problematic in R) with pyarrow: ``` import pyarrow as pa import...
OK, this is the actual plan: ``` ExecPlan with 3 nodes: 2:ConsumingSinkNode{} 1:ProjectNode{projection=[SPORDER, RT, SERIALNO, PUMA, ST, ADJUST, PWGTP, AGEP, CIT, COW, DDRS, DEYE, DOUT, DPHY, DREM, DWRK, ENG, FER,...
It would be nice if the ConsumingSinkNode printed the values of the WriteNodeOptions so we could compare with pyarrow. But glancing at the defaults, they look the same (more or...
> How were you measuring RAM? Were you looking at the RSS of the process? Or were you looking at the amount of free/available memory? I was just looking at...
Thanks! And when you say "increase without bound", how would I know that's happening?
OK, so I've been experimenting with various combinations of this, and have found that it happens with both Python and R, so looks like a C++ issue. I'm running this...
I think we might be rehashing some of the conversation already had a long time ago in https://github.com/apache/arrow/issues/18944#issuecomment-1377665189
I tried it with `mta_tax` which has 385 distinct values, and it also crashes. But I'd expect that, seeing as the data isn't already partitioned on that variable and it'd...