cudf
cudf copied to clipboard
Replace usages of `thrust::optional` with `std::optional`
We want to get rid of thrust types in API boundaries so replace them by the better suited std types
@miscco could you revert the changes to the three pyproject.toml files (I don't think you modified them yourself, maybe a pre-commit hook did?)? Perhaps after merging trunk. Those changes are currently blocking the check-style check.
@wence- Done
I discussed with @miscco and we decided:
- We'll change host-only uses of
thrust::optionaltostd::optionalin this PR - Device uses of
thrust::optionalcan be replaced withcuda::std::optionalin a future PR, once we upgrade to CCCL 2.3 or newer
Is there any reason that this didn't get merged? Should we get this updated and pushed through? Apologies if it simply fell through the cracks.
I believe we decided to push this back until rapids moves to cccl 2.3
Got it, thanks!
We'll revisit once #15327 is addressed then.
@miscco we've updated to 12.5 now, could you please rebase this PR onto the latest 24.08?
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
@vyasr I have force pushed the branch due to it being old and rebased the PR on top of 24.10
Shout if you want it to go into 24.08
/ok to test
/ok to test
@miscco thanks! Nope, 24.10 is fine. I don't think this is urgent, but I was clearing out the backlog of open PRs and realized this one ought to be good to go now and figured better not to let it go more stale.
/ok to test
@miscco Looks like there are legitimate test failures:
[ RUN ] TransformTest.IsNull
/opt/conda/conda-bld/work/cpp/tests/utilities/column_utilities.cu:262: Failure
Expected equality of these values:
lhs.null_count()
Which is: 0
rhs.null_count()
Which is: 4
Google Test trace:
/opt/conda/conda-bld/work/cpp/tests/ast/transform_tests.cpp:134: <-- line of failure
/opt/conda/conda-bld/work/cpp/tests/utilities/column_utilities.cu:262: Failure
Expected equality of these values:
lhs.null_count()
Which is: 0
rhs.null_count()
Which is: 4
Google Test trace:
/opt/conda/conda-bld/work/cpp/tests/ast/transform_tests.cpp:140: <-- line of failure
[ FAILED ] TransformTest.IsNull (2 ms)
OK, we have a few runs now here showing the same error. It's a single Python test that's failing in both wheel and conda CI in the same way, and it's happened after merging in the latest a few times, so I'm guessing it's not a fluke. It's a test of the DataFrame.eval method, which would be affected by changes to the evaluator. Plus, there are the earlier C++ failures that @bdice pointed out above. Those aren't failing anymore, so maybe this is a flaky test? It could be accessing garbage memory via an optional, perhaps, and therefore not reproduce consistently? It might be meaningful that it's consistently failing on CUDA 11.8, but that could be a red herring too.
I'll try one last merge of the latest 24.10 to get one more data point, just to be sure.
Yeah sorry about that, I do not have capacity to investigate this currently
No problem. I pushed on this a little since the original blocker was resolved and I was hoping to help you wrap it up, but since there is more actual work to be done we can put a pin in it for the moment. Not urgent.
I was able to recreate this with CUDA 11.8. The cuda::std::optional change in cpp/include/cudf/ast/detail/operators.hpp seems to be the cause. I used the following to show that eval seems to be getting random garbage when using cuda::std::optional vs thrust::optional
>>> import cudf
>>> df = cudf.DataFrame({"a": [1, 2, 3, None, 5]})
>>> df.eval("isnull(a)")
0 True
1 <NA>
2 <NA>
3 <NA>
4 <NA>
dtype: bool
>>> df.eval("isnull(a)")
0 True
1 False
2 False
3 <NA>
4 False
dtype: bool
The above shows there are very different results when calling eval twice on the same input.
The correct output should be:
>>> df.eval("isnull(a)")
0 False
1 False
2 False
3 True
4 False
dtype: bool
So there is some subtle difference between thrust::optional and cuda::std::optional.
Or perhaps there is an error in the expression evaluator that cudf::std::optional is able to manifest given that the AST_TEST was failing in earlier CI builds as well.
Thanks for the analysis @davidwendt.
Should we consider merging all the changes except those in cpp/include/cudf/ast/detail/operators.hpp for now? Or do we think this error is a showstopper for adoption of cuda::std::optional more generally?
Should we consider merging all the changes except those in
cpp/include/cudf/ast/detail/operators.hppfor now? Or do we think this error is a showstopper for adoption ofcuda::std::optionalmore generally?
I have a hard time believing there is an issue with cuda::std::optional so I'm inclined to undo the changes to operators.hpp and open an issue to figure out what is going on with AST in order to merge this PR sooner than later.
Since @miscco said he doesn't have bandwidth to finish this PR right now, I am planning to push some changes (reverting changes in AST code paths) that will let us complete this. I will address @davidwendt's feedback to remove the use of an optional in AST code paths in a follow-up PR.
@davidwendt Tests are passing for me locally as of 0418c91. I opened https://github.com/rapidsai/cudf/pull/16604 to explore the possibility of removing an optional from the AST code paths. I will also use that PR to investigate whether changing to cuda::std::optional in the AST operators shows this problem -- or if it's just failing in the optional right row index.
@davidwendt Tests are passing for me locally as of 0418c91. I opened #16604 to explore the possibility of removing an optional from the AST code paths. I will also use that PR to investigate whether changing to
cuda::std::optionalin the AST operators shows this problem -- or if it's just failing in the optional right row index.
Sounds good. I did this as part of my investigation. Removing the optional right-row-index had no effect on the AST failure/success on my local machine.
Thanks both of you! I appreciate you taking point on getting this PR finished.
@davidwendt Could you review again? Or can I dismiss your request for changes? I think this will be ready to go once we rerun the Java CI (it failed, but there are other GPU jobs running so I'll retrigger CI later).
/merge
Thanks @miscco and reviewers!
We finally sorted these issues out in #17793, specifically in https://github.com/rapidsai/cudf/pull/17793/commits/886c40e8671f68ee1895ba0f5b0301facfe3bd3b. The issue was in the SFINAE for the IS_NULL operator.