cudf [FEA] Increase maximum characters in strings columns

In libcudf, strings columns have child columns containing character data and offsets, and the offsets child column uses a 32-bit signed size type. This limits strings columns to containing ~2.1 billion characters. For LLM training, documents have up to 1M characters, and a median around 3K characters. Due to the size type limit, LLM training pipelines have to carefully batch the data down to a few thousand rows to stay comfortably within the size type limit. We have a general issue open to explore a 64-bit size type in libcudf (#13159). For size issues with LLM training pipelines, we should consider a targeted change to only address the size limit for strings columns.

Requirements

We must maintain or improve throughput for functions processing strings columns with <2.1 billion characters. This requirement prevents us from using 64-bit offsets for all strings columns. It does not prevent us from using 64-bit offsets for strings columns with >2.1 billion characters.
We must not introduce a new data type or otherwise increase compile times significantly. This requirement prevents us from dispatching between "strings" types and "large strings" types.

Proposed solution

One idea that satisfies these requirements would be to represent the character data as an int64 typed column instead of an int8 typed column. This would allow us to store 8x more bytes of character data. To access the character bytes, we would use an offset-normalizing iterator (inspired by "indexalator") to identify byte positions using an int64 iterator output. Please note that the row count 32-bit size type would still apply to the proposed "large strings" columns.

We should also consider an "unbounded" character data allocation that is not typed, but rather a single buffer up to 2^64 bytes in size. The 64-bit offset type would be able to index into much larger allocations.

Please note that this solution will not impact the offsets for list columns. We believe that the best design to allow for more than 2.1B elements in lists will be to use 64-bit size type in libcudf as discussed in #13159.

Creating strings columns

Strings columns factories would choose child column types at the time of column creation, based on the size of the character data. This change would impact strings column factories, as well as algorithms that use strings column utilities or generate their own offsets buffers. At column creation time, the constructor will choose between int32 offsets with int8 character data and int64 offsets with int64 character data, based on the size of the character data. Any function that calls make_offsets_child_column will need to be aware of the alternate child column types for large strings.

Accessing strings data

The offset-normalizing iterator would always return int64 type so that strings column consumers would not need to support both int32 and int64 offset types. See cudf::detail::sizes_to_offsets_iterator for an example of how an iterator operating on int32 data can output int64 data.

Interoperability with Arrow

The new strings column variant with int64 offsets with int64 character data may already be Arrow-compatible. This requires more testing and some changes to our Arrow interop utilities.

Part 1: libcudf changes to support large strings columns

Definitions: "strings column": int8 character data and int32 offset data (2.1B characters) "large strings column": int8 character data up to 2^64 bytes and int64 offset data (18400T characters)

Step	PR	Notes
Replace `offset_type` references with `size_type`	✅ #13788	offsets generated by the offset-normalizing iterator will have type `int64_t`
~~Add new data-size member to `cudf::column_view`, `cudf::mutable_column_view` and `cudf::column_device_view`~~	❌ #14031	solution for character counts greater than `int32`
Create an offset-normalizing iterator over character data that always outputs 64-bit offsets	✅ #14206 ✅ #14234	First step in #14043
* Add the character data buffer to the parent strings column, rather than as a child column * Also refactor algorithms such as concat, contiguous split and gather which access character data * Update code in cuDF-python that interact with character child columns * Update code in cudf-java that interact with character child columns	✅ #14202	See performance blocker resolved in ✅ #14540
Deprecate unneeded factories and use strings column factories consistently	✅	#14461, #14771, #14695, #14612, +one more
Introduce an environment variable to control the threshold for converting to 64-bit indices, to enable testing on smaller strings columns	✅ `LIBCUDF_LARGE_STRINGS_THRESHOLD` added	part of #14612
Transition strings APIs to use the offset-normalizing iterator ("offsetalator")	✅	See #14611, #14700, #14744, #14745, #14757, #14783, #14824
Remove references to `strings_column_view::offsets_begin()` in libcudf since it hardcodes the return type as int32.	✅	See #15112 #15077
Remove references to `create_chars_child_column` in libcudf since it wraps a column around chars data.	✅
Change the current `make_strings_children` to return a uvector for chars instead of a column	✅	See #15171
Introduce an environment variable `LIBCUDF_LARGE_STRINGS_ENABLED` to let users force libcudf to throw rather than start using 64-bit offsets, to allow try-catch-repartitioning instead	✅
Introduce an environment variable `LIBCUDF_LARGE_STRINGS_THRESHOLD`	✅
Rework `concatenate` to produce large strings when `LIBCUDF_LARGE_STRINGS_ENABLED` and character count is above the `LIBCUDF_LARGE_STRINGS_THRESHOLD`	✅	See #15195
cuDF-python testing. use concat to create a large string column. We should be able to operate on this column, as long as we aren't creating a large string. Can we: (1) returns int/bool, likem, re_contains, (2) slice (3) returns smaller strings.	🔄
Add an `experimental` version of `make_strings_children` that generates 64-bit offsets when the total character length exceeds the threshold	🔄	See #15363
Add a large strings test fixture that stores large columns between unit tests and controls the environment variables	🔄
Add the `experimental` namespace to call sites of `make_strings_children`, ensure that: (1) cudf tests pass with `LIBCUDF_LARGE_STRINGS_THRESHOLD` at zero (2) benchmark regressions analyzed and approved (3) Spark-RAPIDS tests pass
Live session with cuDF-python expert to start producing and operating on large strings
Ensure that we can interop strings columns with 64-bit offsets to arrow as LARGE_STRING type		Also see #15093 about `large_strings` compatibility for pandas-2.2

Part 2: cuIO changes to read and write large strings columns

Step	PR	Notes
Add functionality to JSON reader to construct large string columns		Could require building a chunked JSON reader
Add functionality to Parquet reader to construct large string columns
to be continued...

Jul 22 '23 20:07 GregoryKimball

👏 praise: ‏ Great descriptive issue! 💡 suggestion: ‏ I think the table in the description could be a GH task list so each "step" can be made into an issue. ❓ question: ‏

"large strings column": int64 character data and int64 offset data

int64 data with int64 offsets sounds larger than:

"unbounded strings column": int8 character data up to 2^64 bytes and int64 offset data

And this sounds bounded, but large. Not boundless.

Can you clarify or give an example of the structure?

Sep 20 '23 11:09 harrism

This feature would be a great help for us. We use cudf::table as a CPU-memory storage to enable zero-copy. In theory, CPU memory has more capacity to hold larger tables, but we constantly run into the character limits.

Oct 02 '23 17:10 gaohao95

Following up on @gaohao95's comment, our use case is storing TPC-H data in a cudf::table. At scale factor 100, the l_comment and ps_comment string columns overflow the 32-bit offset.

Oct 03 '23 09:10 LutzCle

Now that #15195 is merged, I did some testing via cuDF-python

>>> df2['char'].str.slice(0,2)  
OK

>>> df2['char'].str.contains(0,2)  
OK

>>> df2['char'].str.contains('p', regex=True)
OK

>>> df2['char'].nunique()
57

# add a column, df2['c'] = 1
>>> df2.groupby('char').sum()['c'].reset_index()
OK

>>> df2['char'] + 'b'
OverflowError: CUDF failure at: /nfs/repo/cudf24.06/cpp/include/cudf/strings/
detail/strings_children.cuh:82: Size of output exceeds the column size limit

>>> df2['char'].str.upper()
RuntimeError: THRUST_INDEX_TYPE_DISPATCH 64-bit count is unsupported in libcudf

So far so good! Many APIs can successfully consume large strings, and only concat can produce them for now. 🎉

Apr 17 '24 03:04 GregoryKimball

This is incredibly exciting! More than any individual string operation, one of the most common pain points I see in workflows is the inability to bring strings along as a payload during joins (now that concat works):

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 6000

df1 = cudf.DataFrame({
    "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[11], line 13
      6 N = 6000
      8 df1 = cudf.DataFrame({
      9     "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
     10     "key": [0, 1]*N
     11 })
---> 13 res = df1.merge(df1, on="key")
     14 print(f"{res.val_x.str.len().sum():,} characters in string column")

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116]
...
File copying.pyx:151, in cudf._lib.copying.gather()

File copying.pyx:34, in cudf._lib.pylibcudf.copying.gather()

File copying.pyx:66, in cudf._lib.pylibcudf.copying.gather()

OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

If I only have numeric data, this works smoothly as the output dataframe is only 72M rows.

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 6000

df1 = cudf.DataFrame({
    "val": [10, 100]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{len(res):,} rows in dataframe")
72,000,000 rows in dataframe

I'd love to be able to complete this (contrived) example, because I think it's representative of something we see often: this limit causing failures in workflows where users expect things to work smoothly.

As a reference, the self-join works with N=5000 as it only ends up with 1.7B total characters.

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 5000

df1 = cudf.DataFrame({
    "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
env: LIBCUDF_LARGE_STRINGS_ENABLED=1
1,675,000,000 characters in string column

Apr 26 '24 15:04 beckernick

The issue described here https://github.com/rapidsai/cudf/issues/13733#issuecomment-2079656314 should be fixed with https://github.com/rapidsai/cudf/pull/15621

May 08 '24 22:05 davidwendt

And now the issue described here https://github.com/rapidsai/cudf/issues/13733#issuecomment-2060290747 should be fixed with #15721

May 31 '24 12:05 davidwendt

CSV This uses factories that are already enabled. Still needs testing.

Gave this a quick spin -- the CSV reader itself works but we fail ~if we need to copy a libcudf column/table from device to host (e.g., if we call .head()~ if we need to go down the to_pandas codepath for device-to-host copies.

So I'd say:

CSV reader appears to work
to_pandas device-to-host transfer fails (will file an issue), while to_arrow works (which makes sense given "part 3" above)

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf

N = int(5e7)

df = cudf.DataFrame({
    "val": ["this is a short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})
df.to_csv("large_string_df.csv", chunksize=1000000, index=False)
del df

df = cudf.read_csv("large_string_df.csv")
print(len(df))
print(df.iloc[0])
100000000
                      val  key
0  this is a short string    0

df.head()
---------------------------------------------------------------------------
ArrowException                            Traceback (most recent call last)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py:711](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py#line=710), in PlainTextFormatter.__call__(self, obj)
    704 stream = StringIO()
    705 printer = pretty.RepresentationPrinter(stream, self.verbose,
    706     self.max_width, self.newline,
    707     max_seq_length=self.max_seq_length,
    708     singleton_pprinters=self.singleton_printers,
    709     type_pprinters=self.type_printers,
    710     deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
    712 printer.flush()
    713 return stream.getvalue()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:411](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=410), in RepresentationPrinter.pretty(self, obj)
    408                         return meth(obj, self, cycle)
    409                 if cls is not object \
    410                         and callable(cls.__dict__.get('__repr__')):
--> 411                     return _repr_pprint(obj, self, cycle)
    413     return _default_pprint(obj, self, cycle)
    414 finally:

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:779](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=778), in _repr_pprint(obj, p, cycle)
    777 """A pprint that just redirects to the normal repr function."""
    778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
    780 lines = output.splitlines()
    781 with p.group():

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1973](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1972), in DataFrame.__repr__(self)
   1970 @_cudf_nvtx_annotate
   1971 def __repr__(self):
   1972     output = self._get_renderable_dataframe()
-> 1973     return self._clean_renderable_dataframe(output)

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1835](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1834), in DataFrame._clean_renderable_dataframe(self, output)
   1832 else:
   1833     width = None
-> 1835 output = output.to_pandas().to_string(
   1836     max_rows=max_rows,
   1837     min_rows=min_rows,
   1838     max_cols=max_cols,
   1839     line_width=width,
   1840     max_colwidth=max_colwidth,
   1841     show_dimensions=show_dimensions,
   1842 )
   1844 lines = output.split("\n")
   1846 if lines[-1].startswith("["):

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5324](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5323), in DataFrame.to_pandas(self, nullable, arrow_type)
   5249 """
   5250 Convert to a Pandas DataFrame.
   5251 
   (...)
   5321 dtype: object
   5322 """
   5323 out_index = self.index.to_pandas()
-> 5324 out_data = {
   5325     i: col.to_pandas(
   5326         index=out_index, nullable=nullable, arrow_type=arrow_type
   5327     )
   5328     for i, col in enumerate(self._data.columns)
   5329 }
   5331 out_df = pd.DataFrame(out_data, index=out_index)
   5332 out_df.columns = self._data.to_pandas_index()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5325](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5324), in <dictcomp>(.0)
   5249 """
   5250 Convert to a Pandas DataFrame.
   5251 
   (...)
   5321 dtype: object
   5322 """
   5323 out_index = self.index.to_pandas()
   5324 out_data = {
-> 5325     i: col.to_pandas(
   5326         index=out_index, nullable=nullable, arrow_type=arrow_type
   5327     )
   5328     for i, col in enumerate(self._data.columns)
   5329 }
   5331 out_df = pd.DataFrame(out_data, index=out_index)
   5332 out_df.columns = self._data.to_pandas_index()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py:5802](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py#line=5801), in StringColumn.to_pandas(self, index, nullable, arrow_type)
   5800     return pd.Series(pandas_array, copy=False, index=index)
   5801 else:
-> 5802     return super().to_pandas(index=index, nullable=nullable)

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py:215](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py#line=214), in ColumnBase.to_pandas(self, index, nullable, arrow_type)
    211     return pd.Series(
    212         pd.arrays.ArrowExtensionArray(pa_array), index=index
    213     )
    214 else:
--> 215     pd_series = pa_array.to_pandas()
    217     if index is not None:
    218         pd_series.index = index

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:872](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=871), in pyarrow.lib._PandasConvertible.to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1517](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1516), in pyarrow.lib.Array._to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1916](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1915), in pyarrow.lib._array_like_to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi:91](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi#line=90), in pyarrow.lib.check_status()

ArrowException: Unknown error: Wrapping

Jun 04 '24 20:06 beckernick

From a local run of our Python test suite with large strings enabled, it seems like:

Enabling this doesn't break things when we've got < 2.1B characters
We don't test scenarios with >= 2.1B characters except when we intentionally trigger and catch a large string OverflowError (which we correctly fail with this turned on), so passing doesn't give a strong signal about readiness

@GregoryKimball @vyasr @davidwendt , what do you think is the right balance of exhaustiveness vs. practicality? I don't think we want to add a large string scenario to every Python unit test..

(Running the test suite with large strings enabled in https://github.com/rapidsai/cudf/pull/15932 so folks can review more easily)

Jun 05 '24 19:06 beckernick

I thought about this a little bit today, but haven't come up with any very concrete guidelines yet. At a high level a reasonable balance would be testing each API that supports large strings once so that we skip any exhaustive sweeps over parameters etc but we do at least exercise all of the ocde paths at least once from Python and ensure that all of the basic logic around dtype handling etc works. Perhaps @davidwendt also has some ideas around common failure modes for large strings, or properties of the large strings data type that are different from normal strings. If there are particular dimensions along which we expect that type to differ then we would benefit from emphasizing testing that. I'll keep thinking though.

Jun 11 '24 00:06 vyasr

Finding the right balance has been a challenge. I've tried to isolate the large-strings decision logic to a few utilities and use those as much as possible hoping that only a few tests would be needed to catch all the common usages. Of course, this coverage requires internal knowledge of how the APIs are implemented. For libcudf, I created a set of large strings unit tests here: https://github.com/rapidsai/cudf/tree/branch-24.08/cpp/tests/large_strings that should cover the known usages. Additional coverage from the Python layer is welcome of course. It would be good to consider the data size requirements (and scope) as well as runtime so as to not overload the CI systems.

Jun 11 '24 14:06 davidwendt

Congratulations! Starting in 24.08, large strings are enabled by default in libcudf and cudf! 🎉🎉🎉🎉

We will track further development in separate issues.

Aug 01 '24 22:08 GregoryKimball

cudf cudf copied to clipboard

[FEA] Increase maximum characters in strings columns

Requirements

Proposed solution

Creating strings columns

Accessing strings data

Interoperability with Arrow

Part 1: libcudf changes to support large strings columns

Part 2: cuIO changes to read and write large strings columns

cudf
cudf copied to clipboard