vaex icon indicating copy to clipboard operation
vaex copied to clipboard

concatenating many dataframes and reordering with strings might fail if not large_string

Open maartenbreddels opened this issue 4 years ago • 3 comments

Related arrow issue: https://issues.apache.org/jira/browse/ARROW-10799

For converting the Gaia data, issue showed up when converting many hdf5 files to one, and sorting it in 1 go. which led to:

~/github/apache/arrow/python/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.take()

~/github/apache/arrow/python/pyarrow/compute.py in take(data, indices, boundscheck, memory_pool)
    421     """
    422     options = TakeOptions(boundscheck=boundscheck)
--> 423     return call_function('take', [data, indices], options, memory_pool)
    424 
    425 

~/github/apache/arrow/python/pyarrow/_compute.pyx in pyarrow._compute.call_function()

~/github/apache/arrow/python/pyarrow/_compute.pyx in pyarrow._compute.Function.call()

~/github/apache/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/github/apache/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: offset overflow while concatenating arrays

In ColumnIndexed at https://github.com/vaexio/vaex/blob/5a9fab638c832506901887ba28452d36913e630a/packages/vaex-core/vaex/column.py#L218.

To reproduce:

vaex convert @input-hdf5.txt --sort=source_id gaia-edr3-sort-by-source_id.hdf5

maartenbreddels avatar Dec 03 '20 18:12 maartenbreddels

Workaround involves casting to large_string, which cannot yet be done using df.s.astype(..), cc @Ben-Epstein :

import vaex
df = vaex.from_arrays(s=['aap', 'noot', None])
df['sl'] = df['astype(s, "large_string")']
df.schema_arrow()
s: string
sl: large_string

maartenbreddels avatar Dec 22 '21 14:12 maartenbreddels

Hello! Any updates on it?

BohdanBilonoh avatar Jul 12 '22 20:07 BohdanBilonoh

Is this still an issue? I think upgrading pyarrow should fix it, otherwise use the workaround I posted

maartenbreddels avatar Jul 26 '22 16:07 maartenbreddels

Hey @maartenbreddels I get the same error message while joining 2 dataframes and I'm quite sure its the same issue.

I tried joining 2 dataframes: If the strings are large (max length = 12,000) then I get the same issue but not if they are small (max length = 50).

I tried resolving the issue by converting both dataframes to large string :

for k in df.columns:
    if k.startswith('__'):
        k = k[2:]
    df[k]  =df[f'astype({k}, "large_string")']

but the same issue persists.

Any suggestions?

Also, do you have any idea how converting to large string impacts runtime? If its significant I will need to run a check on which columns do not need to be type converted.

SohamTamba avatar Nov 11 '22 08:11 SohamTamba