cudf
cudf copied to clipboard
[FEA] Fix case insensitive match on native parquet column pruning
After https://github.com/NVIDIA/spark-rapids-jni/pull/199 and https://github.com/NVIDIA/spark-rapids/pull/5310 we will have an option to use native code to do column pruning and parsing of the footer for parquet. One of the issues is that C++ does not have built in APIs to convert a unicode string to lower case. It can do it a single character at a time, and that works most of the time, but in some cases it can have problems. This is to find a better way to make the strings lowercase.
Is this feature requested in the cuIO reader? Is case insensitivity part of the parquet spec? I believe we can do this one layer above libcudf.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
@davidwendt, does strings::case::to_lower work with unicode?
For reference: https://docs.rapids.ai/api/libcudf/stable/group__strings__case.html#ga8ec672aad6467cc71f37b1a3ac8179eb
There is no case namespace just cudf::strings::to_lower.
There is no unicode support anywhere in libcudf. All strings in libcudf are expected to be UTF-8.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
@revans2 Is this still needed? Also is this a parquet project or a strings project?
This is not needed. It was a nice to have even when it was filed. feel free to close it.