Gregory Kimball
Gregory Kimball
**Is your feature request related to a problem? Please describe.** Based on testing and discussion in #9395 and follow-on testing of `cudf.DataFrame.to_orc`, I propose that we remove the "experimental" warning...
**Is your feature request related to a problem? Please describe.** Fuzz testing support for nested struct columns in Orc is incomplete. Some patches were required as shown in #9395 discussion...
**Is your feature request related to a problem? Please describe.** The current implementation of [multibyte_split](https://github.com/rapidsai/cudf/blob/e2ff00f665472a477f7c5b90bed8045c9d0d40a4/cpp/include/cudf/io/text/multibyte_split.hpp#L70) supports a byte range input to reading of limited portions of large files. However, even...
**Is your feature request related to a problem? Please describe.** While benchmarking cuDF-python, I noticed that [bench_isin](https://github.com/rapidsai/cudf/blob/65a782112f4b76941483adf17f9a30a6824f6164/python/cudf/benchmarks/API/bench_dataframe.py#L50) has low end-to-end data throughput (
**Is your feature request related to a problem? Please describe.** The parquet format in Apache Spark supports many compression codecs ([link](https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration)), including: none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd....
**Describe the bug** In the libcudf benchmarks `PARQUET_READER_NVBENCH`, the STRUCT data type shows surprisingly high `peak_memory_usage`. For a 536 MB table, the INTEGRAL data type shows a 597 MiB peak...
**Is your feature request related to a problem? Please describe.** The `BM_parquet_read_chunks` benchmark in `benchmarks/io/parquet/parquet_reader_input.cpp` includes a `byte_limit` nvbench axis. This axis controls the `chunk_read_limit`. With the new features added...
Parquet V1 format supports three types of page encodings: PLAIN, DICTIONARY, and RLE (run-length encoded) ([reference from Spark Jira](https://issues.apache.org/jira/browse/SPARK-36879)). The newer and evolving Parquet V2 specification adds support for several...
-- this is a draft, please do not comment yet -- The end-to-end throughput of a file reader is limited by the sequential read speed of the underlying data source....
During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included: * Rare failure with page size estimator (PQ writer, [Report](https://github.com/rapidsai/cudf/issues/13250),...