google-cloud-python icon indicating copy to clipboard operation
google-cloud-python copied to clipboard

Error in trying to upload Dataframe with a an Arrow-backed list of large_strings

Open cvm-a opened this issue 5 months ago • 2 comments

Name: pandas-gbq Version: 0.29.1 Name: pyarrow Version: 20.0.0

Steps to reproduce

try to upload a datframe with a column of PyArrow backed list of large strings ( since the total data > 2GiB)

Code example

import pandas as pd
import pyarrow as pa
from google.cloud import bigquery as gbq

client= gbq.Client(
            project=<project_id>,
            credentials=<credentials>,
            location=<location>,
        )
# With arrow-backed series of lists of strings, pandas will fail to perform joins if the total memory usage of that column is > 2GiB, so they need to be typed as lists of large_strings. to have those pandas operations complete.
df = pd.DataFrame({"x":pa.array([["some_string"]]*200_000_000, pa.list_(pa.large_string())).to_pandas(types_mapper=pd.ArrowDtype)})
ljob = client.load_table_from_dataframe(
            df, 'temporary_tables.large_stringlist')

Stack trace

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<path>/lib/python3.11/site-packages/google/cloud/bigquery/client.py", line 2838, in load_table_from_dataframe
    _pandas_helpers.dataframe_to_parquet(
  File "<path>/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 722, in dataframe_to_parquet
    arrow_table = dataframe_to_arrow(dataframe, bq_schema)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<path>/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 665, in dataframe_to_arrow
    bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
  File "<path>/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 377, in bq_to_arrow_array
    return pyarrow.ListArray.from_pandas(series, type=arrow_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 1226, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 311, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 124, in pyarrow.lib._handle_arrow_array_protocol
  File "pyarrow/array.pxi", line 1102, in pyarrow.lib.Array.cast
  File "<path>/lib/python3.11/site-packages/pyarrow/compute.py", line 410, in cast
    return call_function("cast", [arr], options, memory_pool)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_compute.pyx", line 612, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 407, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed casting from large_string to string: input array too large

The problem is that the bq_to_arrow_array function is unnecessarily trying to convert the list<large string> to list<string>

cvm-a avatar Jul 15 '25 00:07 cvm-a

If you have > 2 GB of data in a single string, you won't be able to write it to BigQuery anyway. The row size limit is 100 MB.

tswast avatar Aug 02 '25 15:08 tswast

Marking this as a feature request for large_string support.

tswast avatar Aug 02 '25 15:08 tswast