Gregory Kimball

Results 42 issues of Gregory Kimball

**Is your feature request related to a problem? Please describe.** Based on testing and discussion in #9395 and follow-on testing of `cudf.DataFrame.to_orc`, I propose that we remove the "experimental" warning...

code quality
cuIO

**Is your feature request related to a problem? Please describe.** Fuzz testing support for nested struct columns in Orc is incomplete. Some patches were required as shown in #9395 discussion...

feature request
code quality
cuDF (Python)
cuIO

**Is your feature request related to a problem? Please describe.** The current implementation of [multibyte_split](https://github.com/rapidsai/cudf/blob/e2ff00f665472a477f7c5b90bed8045c9d0d40a4/cpp/include/cudf/io/text/multibyte_split.hpp#L70) supports a byte range input to reading of limited portions of large files. However, even...

feature request
cuIO
Performance

**Is your feature request related to a problem? Please describe.** While benchmarking cuDF-python, I noticed that [bench_isin](https://github.com/rapidsai/cudf/blob/65a782112f4b76941483adf17f9a30a6824f6164/python/cudf/benchmarks/API/bench_dataframe.py#L50) has low end-to-end data throughput (

cuDF (Python)
Performance
inactive-30d

**Is your feature request related to a problem? Please describe.** The parquet format in Apache Spark supports many compression codecs ([link](https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration)), including: none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd....

feature request
0 - Backlog
libcudf
cuIO
helps: Spark

**Describe the bug** In the libcudf benchmarks `PARQUET_READER_NVBENCH`, the STRUCT data type shows surprisingly high `peak_memory_usage`. For a 536 MB table, the INTEGRAL data type shows a 597 MiB peak...

bug
0 - Backlog
libcudf
cuIO

**Is your feature request related to a problem? Please describe.** The `BM_parquet_read_chunks` benchmark in `benchmarks/io/parquet/parquet_reader_input.cpp` includes a `byte_limit` nvbench axis. This axis controls the `chunk_read_limit`. With the new features added...

feature request
0 - Backlog
libcudf
cuIO
helps: Spark

Parquet V1 format supports three types of page encodings: PLAIN, DICTIONARY, and RLE (run-length encoded) ([reference from Spark Jira](https://issues.apache.org/jira/browse/SPARK-36879)). The newer and evolving Parquet V2 specification adds support for several...

feature request
2 - In Progress
libcudf
cuIO
helps: Spark
improvement

-- this is a draft, please do not comment yet -- The end-to-end throughput of a file reader is limited by the sequential read speed of the underlying data source....

feature request
0 - Backlog
libcudf
cuIO
helps: Spark
helps: Python

During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included: * Rare failure with page size estimator (PQ writer, [Report](https://github.com/rapidsai/cudf/issues/13250),...

feature request
0 - Backlog
tests
libcudf
cuIO
helps: Spark