Gregory Kimball issues

Results 42 issues of


                                            Gregory Kimball

[FEA] Remove "experimental" warning for ORC struct writer

**Is your feature request related to a problem? Please describe.** Based on testing and discussion in #9395 and follow-on testing of `cudf.DataFrame.to_orc`, I propose that we remove the "experimental" warning...

code quality

cuIO

[FEA] Support nested struct columns in ORC fuzz tests

**Is your feature request related to a problem? Please describe.** Fuzz testing support for nested struct columns in Orc is incomplete. Some patches were required as shown in #9395 discussion...

feature request

code quality

cuDF (Python)

cuIO

[FEA] Support read_text using a byte range without scanning the full source file

**Is your feature request related to a problem? Please describe.** The current implementation of [multibyte_split](https://github.com/rapidsai/cudf/blob/e2ff00f665472a477f7c5b90bed8045c9d0d40a4/cpp/include/cudf/io/text/multibyte_split.hpp#L70) supports a byte range input to reading of limited portions of large files. However, even...

feature request

cuIO

Performance

[PERF] Improve "isin" performance by only sorting once

**Is your feature request related to a problem? Please describe.** While benchmarking cuDF-python, I noticed that [bench_isin](https://github.com/rapidsai/cudf/blob/65a782112f4b76941483adf17f9a30a6824f6164/python/cudf/benchmarks/API/bench_dataframe.py#L50) has low end-to-end data throughput (

cuDF (Python)

Performance

inactive-30d

[FEA] Add GZIP compression support to parquet writer

**Is your feature request related to a problem? Please describe.** The parquet format in Apache Spark supports many compression codecs ([link](https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration)), including: none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd....

feature request

0 - Backlog

libcudf

cuIO

helps: Spark

[BUG] Reduce peak memory usage for STRUCT decoding in parquet reader

**Describe the bug** In the libcudf benchmarks `PARQUET_READER_NVBENCH`, the STRUCT data type shows surprisingly high `peak_memory_usage`. For a 536 MB table, the INTEGRAL data type shows a 597 MiB peak...

bug

0 - Backlog

libcudf

cuIO

[FEA] Update chunked parquet reader benchmarks to include `pass_read_limit`

**Is your feature request related to a problem? Please describe.** The `BM_parquet_read_chunks` benchmark in `benchmarks/io/parquet/parquet_reader_input.cpp` includes a `byte_limit` nvbench axis. This axis controls the `chunk_read_limit`. With the new features added...

feature request

0 - Backlog

libcudf

cuIO

helps: Spark

[FEA] Support V2 encodings in Parquet reader and writer

Parquet V1 format supports three types of page encodings: PLAIN, DICTIONARY, and RLE (run-length encoded) ([reference from Spark Jira](https://issues.apache.org/jira/browse/SPARK-36879)). The newer and evolving Parquet V2 specification adds support for several...

feature request

2 - In Progress

libcudf

cuIO

helps: Spark

improvement

[FEA] Increase reader throughput by pipelining IO and compute

-- this is a draft, please do not comment yet -- The end-to-end throughput of a file reader is limited by the sequential read speed of the underlying data source....

feature request

0 - Backlog

libcudf

cuIO

helps: Spark

helps: Python

[FEA] Add Parquet and ORC unit tests based on Apache sample files

During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included: * Rare failure with page size estimator (PQ writer, [Report](https://github.com/rapidsai/cudf/issues/13250),...

feature request

0 - Backlog

tests

libcudf

cuIO

helps: Spark