polars
polars copied to clipboard
New Binary/String type
The goal is to replace the current Arrow (Large)String type with a string type that allows a union between an inlined small string and an offset to a string that is allocated somewhere else.
This would prevent the terrible performance we have when filtering/gathering large string data as that forces a copy of all bytes. Second this type also allows string interning. As duplicates can only be stored once in the buffer and then we can point to that string multiple times.
Relevant arrow discussion here: https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
- Implement in
polars-arrow#13243 - Use prefix in equalit #13715 (prefix not used)
- Implement IPC #13464
- Implement Parquet #13489
- Implement opt-in Polars flavor of IPC (until offically supported in arrow and all issues resolved.
- Use the polars IPC flavor in pickle
- Use the polars IPC flavor in OOC
- Elide utf8-validation in OOC
- Use the polars IPC flavor in pickle
- Optional: change avro to new type (otherwise pay conversion cost until implemented)
- Migrate polars and all compute
- Check dataframe protocol correctness.