cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Fix case insensitive match on native parquet column pruning

Open revans2 opened this issue 3 years ago • 6 comments

After https://github.com/NVIDIA/spark-rapids-jni/pull/199 and https://github.com/NVIDIA/spark-rapids/pull/5310 we will have an option to use native code to do column pruning and parsing of the footer for parquet. One of the issues is that C++ does not have built in APIs to convert a unicode string to lower case. It can do it a single character at a time, and that works most of the time, but in some cases it can have problems. This is to find a better way to make the strings lowercase.

revans2 avatar Apr 27 '22 13:04 revans2

Is this feature requested in the cuIO reader? Is case insensitivity part of the parquet spec? I believe we can do this one layer above libcudf.

devavret avatar May 18 '22 08:05 devavret

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jun 17 '22 09:06 github-actions[bot]

@davidwendt, does strings::case::to_lower work with unicode?

GregoryKimball avatar Jun 28 '22 23:06 GregoryKimball

For reference: https://docs.rapids.ai/api/libcudf/stable/group__strings__case.html#ga8ec672aad6467cc71f37b1a3ac8179eb There is no case namespace just cudf::strings::to_lower. There is no unicode support anywhere in libcudf. All strings in libcudf are expected to be UTF-8.

davidwendt avatar Jun 29 '22 13:06 davidwendt

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jul 29 '22 13:07 github-actions[bot]

@revans2 Is this still needed? Also is this a parquet project or a strings project?

GregoryKimball avatar Feb 16 '24 23:02 GregoryKimball

This is not needed. It was a nice to have even when it was filed. feel free to close it.

revans2 avatar Feb 20 '24 20:02 revans2