cudf
cudf copied to clipboard
[FEA] Increase maximum characters in strings columns
In libcudf, strings columns have child columns containing character data and offsets, and the offsets child column uses a 32-bit signed size type. This limits strings columns to containing ~2.1 billion characters. For LLM training, documents have up to 1M characters, and a median around 3K characters. Due to the size type limit, LLM training pipelines have to carefully batch the data down to a few thousand rows to stay comfortably within the size type limit. We have a general issue open to explore a 64-bit size type in libcudf (#13159). For size issues with LLM training pipelines, we should consider a targeted change to only address the size limit for strings columns.
Requirements
- We must maintain or improve throughput for functions processing strings columns with <2.1 billion characters. This requirement prevents us from using 64-bit offsets for all strings columns. It does not prevent us from using 64-bit offsets for strings columns with >2.1 billion characters.
- We must not introduce a new data type or otherwise increase compile times significantly. This requirement prevents us from dispatching between "strings" types and "large strings" types.
Proposed solution
One idea that satisfies these requirements would be to represent the character data as an int64
typed column instead of an int8
typed column. This would allow us to store 8x more bytes of character data. To access the character bytes, we would use an offset-normalizing iterator (inspired by "indexalator") to identify byte positions using an int64
iterator output. Please note that the row count 32-bit size type would still apply to the proposed "large strings" columns.
We should also consider an "unbounded" character data allocation that is not typed, but rather a single buffer up to 2^64 bytes in size. The 64-bit offset type would be able to index into much larger allocations.
Please note that this solution will not impact the offsets for list columns. We believe that the best design to allow for more than 2.1B elements in lists will be to use 64-bit size type in libcudf as discussed in #13159.
Creating strings columns
Strings columns factories would choose child column types at the time of column creation, based on the size of the character data. This change would impact strings column factories, as well as algorithms that use strings column utilities or generate their own offsets buffers. At column creation time, the constructor will choose between int32
offsets with int8
character data and int64
offsets with int64
character data, based on the size of the character data. Any function that calls make_offsets_child_column will need to be aware of the alternate child column types for large strings.
Accessing strings data
The offset-normalizing iterator would always return int64
type so that strings column consumers would not need to support both int32
and int64
offset types. See cudf::detail::sizes_to_offsets_iterator for an example of how an iterator operating on int32
data can output int64
data.
Interoperability with Arrow
The new strings column variant with int64
offsets with int64
character data may already be Arrow-compatible. This requires more testing and some changes to our Arrow interop utilities.
Part 1: libcudf changes to support large strings columns
Definitions:
"strings column": int8
character data and int32
offset data (2.1B characters)
"large strings column": int8
character data up to 2^64 bytes and int64
offset data (18400T characters)
Step | PR | Notes |
---|---|---|
Replace offset_type references with size_type |
✅ #13788 | offsets generated by the offset-normalizing iterator will have type int64_t |
cudf::column_view , cudf::mutable_column_view and cudf::column_device_view |
❌ #14031 | solution for character counts greater than int32 |
Create an offset-normalizing iterator over character data that always outputs 64-bit offsets | ✅ #14206 ✅ #14234 |
First step in #14043 |
* Add the character data buffer to the parent strings column, rather than as a child column * Also refactor algorithms such as concat, contiguous split and gather which access character data * Update code in cuDF-python that interact with character child columns * Update code in cudf-java that interact with character child columns |
✅ #14202 | See performance blocker resolved in ✅ #14540 |
Deprecate unneeded factories and use strings column factories consistently | ✅ | #14461, #14771, #14695, #14612, +one more |
Introduce an environment variable to control the threshold for converting to 64-bit indices, to enable testing on smaller strings columns | ✅ LIBCUDF_LARGE_STRINGS_THRESHOLD added |
part of #14612 |
Transition strings APIs to use the offset-normalizing iterator ("offsetalator") | ✅ | See #14611, #14700, #14744, #14745, #14757, #14783, #14824 |
Remove references to strings_column_view::offsets_begin() in libcudf since it hardcodes the return type as int32. |
✅ | See #15112 #15077 |
Remove references to create_chars_child_column in libcudf since it wraps a column around chars data. |
✅ | |
Change the current make_strings_children to return a uvector for chars instead of a column |
✅ | See #15171 |
Introduce an environment variable LIBCUDF_LARGE_STRINGS_ENABLED to let users force libcudf to throw rather than start using 64-bit offsets, to allow try-catch-repartitioning instead |
✅ | |
Introduce an environment variable LIBCUDF_LARGE_STRINGS_THRESHOLD |
✅ | |
Rework concatenate to produce large strings when LIBCUDF_LARGE_STRINGS_ENABLED and character count is above the LIBCUDF_LARGE_STRINGS_THRESHOLD |
✅ | See #15195 |
cuDF-python testing. use concat to create a large string column. We should be able to operate on this column, as long as we aren't creating a large string. Can we: (1) returns int/bool, likem, re_contains, (2) slice (3) returns smaller strings. | 🔄 | |
Add an experimental version of make_strings_children that generates 64-bit offsets when the total character length exceeds the threshold |
🔄 | See #15363 |
Add a large strings test fixture that stores large columns between unit tests and controls the environment variables | 🔄 | |
Add the experimental namespace to call sites of make_strings_children , ensure that: (1) cudf tests pass with LIBCUDF_LARGE_STRINGS_THRESHOLD at zero (2) benchmark regressions analyzed and approved (3) Spark-RAPIDS tests pass |
||
Live session with cuDF-python expert to start producing and operating on large strings | ||
Ensure that we can interop strings columns with 64-bit offsets to arrow as LARGE_STRING type | Also see #15093 about large_strings compatibility for pandas-2.2 |
Part 2: cuIO changes to read and write large strings columns
Step | PR | Notes |
---|---|---|
Add functionality to JSON reader to construct large string columns | Could require building a chunked JSON reader | |
Add functionality to Parquet reader to construct large string columns | ||
to be continued... |
👏 praise: Great descriptive issue! 💡 suggestion: I think the table in the description could be a GH task list so each "step" can be made into an issue. ❓ question:
"large strings column": int64 character data and int64 offset data
int64 data with int64 offsets sounds larger than:
"unbounded strings column": int8 character data up to 2^64 bytes and int64 offset data
And this sounds bounded, but large. Not boundless.
Can you clarify or give an example of the structure?
This feature would be a great help for us. We use cudf::table
as a CPU-memory storage to enable zero-copy. In theory, CPU memory has more capacity to hold larger tables, but we constantly run into the character limits.
Following up on @gaohao95's comment, our use case is storing TPC-H data in a cudf::table
. At scale factor 100, the l_comment
and ps_comment
string columns overflow the 32-bit offset.
Now that #15195 is merged, I did some testing via cuDF-python
>>> df2['char'].str.slice(0,2)
OK
>>> df2['char'].str.contains(0,2)
OK
>>> df2['char'].str.contains('p', regex=True)
OK
>>> df2['char'].nunique()
57
# add a column, df2['c'] = 1
>>> df2.groupby('char').sum()['c'].reset_index()
OK
>>> df2['char'] + 'b'
OverflowError: CUDF failure at: /nfs/repo/cudf24.06/cpp/include/cudf/strings/
detail/strings_children.cuh:82: Size of output exceeds the column size limit
>>> df2['char'].str.upper()
RuntimeError: THRUST_INDEX_TYPE_DISPATCH 64-bit count is unsupported in libcudf
So far so good! Many APIs can successfully consume large strings, and only concat can produce them for now. 🎉
This is incredibly exciting! More than any individual string operation, one of the most common pain points I see in workflows is the inability to bring strings along as a payload during joins (now that concat works):
%env LIBCUDF_LARGE_STRINGS_ENABLED=1
import cudf
import numpy as np
N = 6000
df1 = cudf.DataFrame({
"val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
"key": [0, 1]*N
})
res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
Cell In[11], line 13
6 N = 6000
8 df1 = cudf.DataFrame({
9 "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
10 "key": [0, 1]*N
11 })
---> 13 res = df1.merge(df1, on="key")
14 print(f"{res.val_x.str.len().sum():,} characters in string column")
File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116]
...
File copying.pyx:151, in cudf._lib.copying.gather()
File copying.pyx:34, in cudf._lib.pylibcudf.copying.gather()
File copying.pyx:66, in cudf._lib.pylibcudf.copying.gather()
OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit
If I only have numeric data, this works smoothly as the output dataframe is only 72M rows.
%env LIBCUDF_LARGE_STRINGS_ENABLED=1
import cudf
import numpy as np
N = 6000
df1 = cudf.DataFrame({
"val": [10, 100]*N,
"key": [0, 1]*N
})
res = df1.merge(df1, on="key")
print(f"{len(res):,} rows in dataframe")
72,000,000 rows in dataframe
I'd love to be able to complete this (contrived) example, because I think it's representative of something we see often: this limit causing failures in workflows where users expect things to work smoothly.
As a reference, the self-join works with N=5000
as it only ends up with 1.7B total characters.
%env LIBCUDF_LARGE_STRINGS_ENABLED=1
import cudf
import numpy as np
N = 5000
df1 = cudf.DataFrame({
"val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
"key": [0, 1]*N
})
res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
env: LIBCUDF_LARGE_STRINGS_ENABLED=1
1,675,000,000 characters in string column
The issue described here https://github.com/rapidsai/cudf/issues/13733#issuecomment-2079656314 should be fixed with https://github.com/rapidsai/cudf/pull/15621
And now the issue described here https://github.com/rapidsai/cudf/issues/13733#issuecomment-2060290747 should be fixed with #15721
CSV This uses factories that are already enabled. Still needs testing.
Gave this a quick spin -- the CSV reader itself works but we fail ~if we need to copy a libcudf column/table from device to host (e.g., if we call .head()
~ if we need to go down the to_pandas
codepath for device-to-host copies.
So I'd say:
- CSV reader appears to work
-
to_pandas
device-to-host transfer fails (will file an issue), whileto_arrow
works (which makes sense given "part 3" above)
%env LIBCUDF_LARGE_STRINGS_ENABLED=1
import cudf
N = int(5e7)
df = cudf.DataFrame({
"val": ["this is a short string", "this one is a bit longer, but not much"]*N,
"key": [0, 1]*N
})
df.to_csv("large_string_df.csv", chunksize=1000000, index=False)
del df
df = cudf.read_csv("large_string_df.csv")
print(len(df))
print(df.iloc[0])
100000000
val key
0 this is a short string 0
df.head()
---------------------------------------------------------------------------
ArrowException Traceback (most recent call last)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py:711](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py#line=710), in PlainTextFormatter.__call__(self, obj)
704 stream = StringIO()
705 printer = pretty.RepresentationPrinter(stream, self.verbose,
706 self.max_width, self.newline,
707 max_seq_length=self.max_seq_length,
708 singleton_pprinters=self.singleton_printers,
709 type_pprinters=self.type_printers,
710 deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
712 printer.flush()
713 return stream.getvalue()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:411](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=410), in RepresentationPrinter.pretty(self, obj)
408 return meth(obj, self, cycle)
409 if cls is not object \
410 and callable(cls.__dict__.get('__repr__')):
--> 411 return _repr_pprint(obj, self, cycle)
413 return _default_pprint(obj, self, cycle)
414 finally:
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:779](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=778), in _repr_pprint(obj, p, cycle)
777 """A pprint that just redirects to the normal repr function."""
778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
780 lines = output.splitlines()
781 with p.group():
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1973](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1972), in DataFrame.__repr__(self)
1970 @_cudf_nvtx_annotate
1971 def __repr__(self):
1972 output = self._get_renderable_dataframe()
-> 1973 return self._clean_renderable_dataframe(output)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1835](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1834), in DataFrame._clean_renderable_dataframe(self, output)
1832 else:
1833 width = None
-> 1835 output = output.to_pandas().to_string(
1836 max_rows=max_rows,
1837 min_rows=min_rows,
1838 max_cols=max_cols,
1839 line_width=width,
1840 max_colwidth=max_colwidth,
1841 show_dimensions=show_dimensions,
1842 )
1844 lines = output.split("\n")
1846 if lines[-1].startswith("["):
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5324](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5323), in DataFrame.to_pandas(self, nullable, arrow_type)
5249 """
5250 Convert to a Pandas DataFrame.
5251
(...)
5321 dtype: object
5322 """
5323 out_index = self.index.to_pandas()
-> 5324 out_data = {
5325 i: col.to_pandas(
5326 index=out_index, nullable=nullable, arrow_type=arrow_type
5327 )
5328 for i, col in enumerate(self._data.columns)
5329 }
5331 out_df = pd.DataFrame(out_data, index=out_index)
5332 out_df.columns = self._data.to_pandas_index()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5325](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5324), in <dictcomp>(.0)
5249 """
5250 Convert to a Pandas DataFrame.
5251
(...)
5321 dtype: object
5322 """
5323 out_index = self.index.to_pandas()
5324 out_data = {
-> 5325 i: col.to_pandas(
5326 index=out_index, nullable=nullable, arrow_type=arrow_type
5327 )
5328 for i, col in enumerate(self._data.columns)
5329 }
5331 out_df = pd.DataFrame(out_data, index=out_index)
5332 out_df.columns = self._data.to_pandas_index()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py:5802](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py#line=5801), in StringColumn.to_pandas(self, index, nullable, arrow_type)
5800 return pd.Series(pandas_array, copy=False, index=index)
5801 else:
-> 5802 return super().to_pandas(index=index, nullable=nullable)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py:215](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py#line=214), in ColumnBase.to_pandas(self, index, nullable, arrow_type)
211 return pd.Series(
212 pd.arrays.ArrowExtensionArray(pa_array), index=index
213 )
214 else:
--> 215 pd_series = pa_array.to_pandas()
217 if index is not None:
218 pd_series.index = index
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:872](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=871), in pyarrow.lib._PandasConvertible.to_pandas()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1517](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1516), in pyarrow.lib.Array._to_pandas()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1916](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1915), in pyarrow.lib._array_like_to_pandas()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi:91](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi#line=90), in pyarrow.lib.check_status()
ArrowException: Unknown error: Wrapping
From a local run of our Python test suite with large strings enabled, it seems like:
- Enabling this doesn't break things when we've got < 2.1B characters
- We don't test scenarios with >= 2.1B characters except when we intentionally trigger and catch a large string OverflowError (which we correctly fail with this turned on), so passing doesn't give a strong signal about readiness
@GregoryKimball @vyasr @davidwendt , what do you think is the right balance of exhaustiveness vs. practicality? I don't think we want to add a large string scenario to every Python unit test..
(Running the test suite with large strings enabled in https://github.com/rapidsai/cudf/pull/15932 so folks can review more easily)
I thought about this a little bit today, but haven't come up with any very concrete guidelines yet. At a high level a reasonable balance would be testing each API that supports large strings once so that we skip any exhaustive sweeps over parameters etc but we do at least exercise all of the ocde paths at least once from Python and ensure that all of the basic logic around dtype handling etc works. Perhaps @davidwendt also has some ideas around common failure modes for large strings, or properties of the large strings data type that are different from normal strings. If there are particular dimensions along which we expect that type to differ then we would benefit from emphasizing testing that. I'll keep thinking though.
Finding the right balance has been a challenge. I've tried to isolate the large-strings decision logic to a few utilities and use those as much as possible hoping that only a few tests would be needed to catch all the common usages. Of course, this coverage requires internal knowledge of how the APIs are implemented. For libcudf, I created a set of large strings unit tests here: https://github.com/rapidsai/cudf/tree/branch-24.08/cpp/tests/large_strings that should cover the known usages. Additional coverage from the Python layer is welcome of course. It would be good to consider the data size requirements (and scope) as well as runtime so as to not overload the CI systems.
Congratulations! Starting in 24.08, large strings are enabled by default in libcudf and cudf! 🎉🎉🎉🎉
We will track further development in separate issues.