cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Replace usages of `thrust::optional` with `std::optional`

Open miscco opened this issue 1 year ago • 6 comments

We want to get rid of thrust types in API boundaries so replace them by the better suited std types

miscco avatar Feb 20 '24 16:02 miscco

@miscco could you revert the changes to the three pyproject.toml files (I don't think you modified them yourself, maybe a pre-commit hook did?)? Perhaps after merging trunk. Those changes are currently blocking the check-style check.

wence- avatar Mar 11 '24 10:03 wence-

@wence- Done

miscco avatar Mar 11 '24 13:03 miscco

I discussed with @miscco and we decided:

  • We'll change host-only uses of thrust::optional to std::optional in this PR
  • Device uses of thrust::optional can be replaced with cuda::std::optional in a future PR, once we upgrade to CCCL 2.3 or newer

bdice avatar Mar 18 '24 15:03 bdice

Is there any reason that this didn't get merged? Should we get this updated and pushed through? Apologies if it simply fell through the cracks.

vyasr avatar May 21 '24 18:05 vyasr

I believe we decided to push this back until rapids moves to cccl 2.3

miscco avatar May 22 '24 14:05 miscco

Got it, thanks!

We'll revisit once #15327 is addressed then.

vyasr avatar May 23 '24 04:05 vyasr

@miscco we've updated to 12.5 now, could you please rebase this PR onto the latest 24.08?

vyasr avatar Jul 19 '24 16:07 vyasr

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jul 24 '24 08:07 copy-pr-bot[bot]

@vyasr I have force pushed the branch due to it being old and rebased the PR on top of 24.10

Shout if you want it to go into 24.08

miscco avatar Jul 24 '24 08:07 miscco

/ok to test

miscco avatar Jul 24 '24 14:07 miscco

/ok to test

bdice avatar Jul 24 '24 15:07 bdice

@miscco thanks! Nope, 24.10 is fine. I don't think this is urgent, but I was clearing out the backlog of open PRs and realized this one ought to be good to go now and figured better not to let it go more stale.

vyasr avatar Jul 24 '24 16:07 vyasr

/ok to test

bdice avatar Jul 25 '24 16:07 bdice

@miscco Looks like there are legitimate test failures:

[ RUN      ] TransformTest.IsNull
/opt/conda/conda-bld/work/cpp/tests/utilities/column_utilities.cu:262: Failure
Expected equality of these values:
  lhs.null_count()
    Which is: 0
  rhs.null_count()
    Which is: 4
Google Test trace:
/opt/conda/conda-bld/work/cpp/tests/ast/transform_tests.cpp:134:  <--  line of failure

/opt/conda/conda-bld/work/cpp/tests/utilities/column_utilities.cu:262: Failure
Expected equality of these values:
  lhs.null_count()
    Which is: 0
  rhs.null_count()
    Which is: 4
Google Test trace:
/opt/conda/conda-bld/work/cpp/tests/ast/transform_tests.cpp:140:  <--  line of failure

[  FAILED  ] TransformTest.IsNull (2 ms)

bdice avatar Jul 26 '24 15:07 bdice

OK, we have a few runs now here showing the same error. It's a single Python test that's failing in both wheel and conda CI in the same way, and it's happened after merging in the latest a few times, so I'm guessing it's not a fluke. It's a test of the DataFrame.eval method, which would be affected by changes to the evaluator. Plus, there are the earlier C++ failures that @bdice pointed out above. Those aren't failing anymore, so maybe this is a flaky test? It could be accessing garbage memory via an optional, perhaps, and therefore not reproduce consistently? It might be meaningful that it's consistently failing on CUDA 11.8, but that could be a red herring too.

I'll try one last merge of the latest 24.10 to get one more data point, just to be sure.

vyasr avatar Aug 16 '24 21:08 vyasr

Yeah sorry about that, I do not have capacity to investigate this currently

miscco avatar Aug 19 '24 08:08 miscco

No problem. I pushed on this a little since the original blocker was resolved and I was hoping to help you wrap it up, but since there is more actual work to be done we can put a pin in it for the moment. Not urgent.

vyasr avatar Aug 19 '24 15:08 vyasr

I was able to recreate this with CUDA 11.8. The cuda::std::optional change in cpp/include/cudf/ast/detail/operators.hpp seems to be the cause. I used the following to show that eval seems to be getting random garbage when using cuda::std::optional vs thrust::optional

>>> import cudf
>>> df = cudf.DataFrame({"a": [1, 2, 3, None, 5]})
>>> df.eval("isnull(a)")
0    True
1    <NA>
2    <NA>
3    <NA>
4    <NA>
dtype: bool
>>> df.eval("isnull(a)")
0     True
1    False
2    False
3     <NA>
4    False
dtype: bool

The above shows there are very different results when calling eval twice on the same input. The correct output should be:

>>> df.eval("isnull(a)")
0    False
1    False
2    False
3     True
4    False
dtype: bool

So there is some subtle difference between thrust::optional and cuda::std::optional. Or perhaps there is an error in the expression evaluator that cudf::std::optional is able to manifest given that the AST_TEST was failing in earlier CI builds as well.

davidwendt avatar Aug 19 '24 18:08 davidwendt

Thanks for the analysis @davidwendt.

Should we consider merging all the changes except those in cpp/include/cudf/ast/detail/operators.hpp for now? Or do we think this error is a showstopper for adoption of cuda::std::optional more generally?

bdice avatar Aug 19 '24 18:08 bdice

Should we consider merging all the changes except those in cpp/include/cudf/ast/detail/operators.hpp for now? Or do we think this error is a showstopper for adoption of cuda::std::optional more generally?

I have a hard time believing there is an issue with cuda::std::optional so I'm inclined to undo the changes to operators.hpp and open an issue to figure out what is going on with AST in order to merge this PR sooner than later.

davidwendt avatar Aug 19 '24 18:08 davidwendt

Since @miscco said he doesn't have bandwidth to finish this PR right now, I am planning to push some changes (reverting changes in AST code paths) that will let us complete this. I will address @davidwendt's feedback to remove the use of an optional in AST code paths in a follow-up PR.

bdice avatar Aug 19 '24 18:08 bdice

@davidwendt Tests are passing for me locally as of 0418c91. I opened https://github.com/rapidsai/cudf/pull/16604 to explore the possibility of removing an optional from the AST code paths. I will also use that PR to investigate whether changing to cuda::std::optional in the AST operators shows this problem -- or if it's just failing in the optional right row index.

bdice avatar Aug 19 '24 19:08 bdice

@davidwendt Tests are passing for me locally as of 0418c91. I opened #16604 to explore the possibility of removing an optional from the AST code paths. I will also use that PR to investigate whether changing to cuda::std::optional in the AST operators shows this problem -- or if it's just failing in the optional right row index.

Sounds good. I did this as part of my investigation. Removing the optional right-row-index had no effect on the AST failure/success on my local machine.

davidwendt avatar Aug 19 '24 19:08 davidwendt

Thanks both of you! I appreciate you taking point on getting this PR finished.

vyasr avatar Aug 19 '24 20:08 vyasr

@davidwendt Could you review again? Or can I dismiss your request for changes? I think this will be ready to go once we rerun the Java CI (it failed, but there are other GPU jobs running so I'll retrigger CI later).

bdice avatar Aug 20 '24 00:08 bdice

/merge

bdice avatar Aug 20 '24 14:08 bdice

Thanks @miscco and reviewers!

bdice avatar Aug 20 '24 14:08 bdice

We finally sorted these issues out in #17793, specifically in https://github.com/rapidsai/cudf/pull/17793/commits/886c40e8671f68ee1895ba0f5b0301facfe3bd3b. The issue was in the SFINAE for the IS_NULL operator.

vyasr avatar Mar 06 '25 22:03 vyasr