Otter the error loading large parquet file

as mention on here, failed to load the LA_RLHF.parquet (about 22GB) file, which is downloaded from shared Onedrive.

Error msg: pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2368257792.

is there special way or python package required to load this large (image base64) parquet file?

Nov 10 '23 07:11 peiliu0408

ohh I am not sure why this happens, my pandas version is 2.1.2. I am using pandas.read_parquet to open parquet files.

Nov 10 '23 07:11 Luodian

The same error msg showing again. I have tried on ubuntu20.04, and get the same error.

I am wondering there are some errors happens when downloading.

Nov 10 '23 08:11 peiliu0408

Or, can you give a llava-RLHF dataset download link, which is census with your Onedrive file? I just create a new image parquet using the same format.

Nov 10 '23 08:11 peiliu0408

Can you try with LLAVAR or LRV? I uploaded them as well.

Nov 10 '23 08:11 Luodian

LLAVA-RLHF seems correct at my side. I will update it but the previous version should be correct. It's weird.

Nov 10 '23 08:11 Luodian

thanks a lot.

Nov 13 '23 01:11 peiliu0408

Can you try with LLAVAR or LRV? I uploaded them as well.

this two parquet file could be opened correctly.

Nov 13 '23 01:11 peiliu0408

LLAVA-RLHF seems correct at my side. I will update it but the previous version should be correct. It's weird.

I am sure that LLaVA-RLHF shared in Onedrive checkpoint is damaged, while all rest could be loaded correctly.

Nov 13 '23 08:11 peiliu0408

I am putting an updated LA_RLHF.parquet file from our server (this is supposedly correctly working for our runs) to OneDrive. May take a few hours, stay tuned tmr maybe. Thanks!

Nov 13 '23 08:11 Luodian

I am putting an updated LA_RLHF.parquet file from our server (this is supposedly correctly working for our runs) to OneDrive. May take a few hours, stay tuned tmr maybe. Thanks!

thanks a lot.

Nov 13 '23 11:11 peiliu0408

The same error msg showing again. I have tried on ubuntu20.04, and get the same error.

I am wondering there are some errors happens when downloading.

Still not correct

Nov 24 '23 15:11 311dada

you need use dask:


import dask.dataframe as dd
import json
import pandas as pd

# Load the JSON data
json_file_path = "LA.json"
with open(json_file_path, "r") as f:
    data_dict = json.load(f)

# Convert the dictionary to a Dask DataFrame
ddf = dd.from_pandas(pd.DataFrame.from_dict(data_dict, orient="index", columns=["base64"]), npartitions=10)

# Convert to Parquet
parquet_file_path = 'LA.parquet'
ddf.to_parquet(parquet_file_path, engine="pyarrow")


ddf = dd.read_parquet(parquet_file_path, engine="pyarrow")
search_value = 'LA_IMG_000000377944'
filtered_ddf = ddf.loc[search_value].compute()

which solved this problem.

Nov 25 '23 00:11 tensorboy

as mention on here, failed to load the LA_RLHF.parquet (about 22GB) file, which is downloaded from shared Onedrive.

Error msg: pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2368257792.

is there special way or python package required to load this large (image base64) parquet file?

You can see current code to see if it helps, we changed it to iteratively loading. https://github.com/Luodian/Otter/blob/3a746889cb0d774659f67a49c874100b226c9c94/pipeline/mimicit_utils/mimicit_dataset.py#L222-L229

Previously on both of my 2A100 and 8A100 instances, I could directly load >100GB parquet. But it's weird that I cant do it on another 8*A100-40G instance...

Dec 10 '23 13:12 Luodian

Otter Otter copied to clipboard

the error loading large parquet file

Otter
Otter copied to clipboard