DrMaphuse
DrMaphuse
Thanks for the thorough response! I totally understand that it doesn't make sense if it adds too much complexity. The suggested workaround is neat, but it relies on schema inference,...
I can't reproduce the problem easily myself now, because we switched to another filesystem for other reasons, and that made the problem go away. I suspect, however, that it had...
> `is_unique` should only have one answer, if a column value is unique How can I replicate the `subset=` argument of `unique()` using `is_first`, i.e. evaluate uniqueness across multiple columns?...
In my head, it would make sense to add `pl.is_first()`, analogous to `pl.sum()`. Implementation of `is_first` for `pl.list` dtype could also give us a potential solution.
@ritchie46 Thanks, that is perfect. The original motivation for opening this issue was that I feel that implicitly omitting values based on the values of other columns is not ideal....
There was a join / select with a missing column somewhere in the middle of the script and it appears that this caused a chain of subsequent errors to pile...
@ritchie46 I actually managed to produce a minimal example, see updated OP. The issue does not appear to be related to the parquet file as originally thought.
I can confirm that this limit makes incremental updates unfeasible for larger datasets. I am trying to insert a delta of about 1GB every day, and uploading that in chunks...