vaex
vaex copied to clipboard
concatenating many dataframes and reordering with strings might fail if not large_string
Related arrow issue: https://issues.apache.org/jira/browse/ARROW-10799
For converting the Gaia data, issue showed up when converting many hdf5 files to one, and sorting it in 1 go. which led to:
~/github/apache/arrow/python/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.take()
~/github/apache/arrow/python/pyarrow/compute.py in take(data, indices, boundscheck, memory_pool)
421 """
422 options = TakeOptions(boundscheck=boundscheck)
--> 423 return call_function('take', [data, indices], options, memory_pool)
424
425
~/github/apache/arrow/python/pyarrow/_compute.pyx in pyarrow._compute.call_function()
~/github/apache/arrow/python/pyarrow/_compute.pyx in pyarrow._compute.Function.call()
~/github/apache/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/github/apache/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: offset overflow while concatenating arrays
In ColumnIndexed
at https://github.com/vaexio/vaex/blob/5a9fab638c832506901887ba28452d36913e630a/packages/vaex-core/vaex/column.py#L218.
To reproduce:
vaex convert @input-hdf5.txt --sort=source_id gaia-edr3-sort-by-source_id.hdf5
Workaround involves casting to large_string, which cannot yet be done using df.s.astype(..)
, cc @Ben-Epstein :
import vaex
df = vaex.from_arrays(s=['aap', 'noot', None])
df['sl'] = df['astype(s, "large_string")']
df.schema_arrow()
s: string
sl: large_string
Hello! Any updates on it?
Is this still an issue? I think upgrading pyarrow should fix it, otherwise use the workaround I posted
Hey @maartenbreddels I get the same error message while joining 2 dataframes and I'm quite sure its the same issue.
I tried joining 2 dataframes: If the strings are large (max length = 12,000) then I get the same issue but not if they are small (max length = 50).
I tried resolving the issue by converting both dataframes to large string :
for k in df.columns:
if k.startswith('__'):
k = k[2:]
df[k] =df[f'astype({k}, "large_string")']
but the same issue persists.
Any suggestions?
Also, do you have any idea how converting to large string impacts runtime? If its significant I will need to run a check on which columns do not need to be type converted.