communitynotes icon indicating copy to clipboard operation
communitynotes copied to clipboard

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Open Jacobsonradical opened this issue 10 months ago • 4 comments

Describe the bug concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/community-note/communitynotes/sourcecode/scoring/run_scoring.py", line 294, in _run_scorer_parallelizable
scoringResults = scorer.prescore(scoringArgs, preserveRatings=not runParallel)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/community-note/communitynotes/sourcecode/scoring/scorer.py", line 301, in prescore
noteScores, userScores, metaScores = self._prescore_notes_and_users(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/community-note/communitynotes/sourcecode/scoring/mf_base_scorer.py", line 554, in _prescore_notes_and_users
) = self._run_stable_matrix_factorization(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/community-note/communitynotes/sourcecode/scoring/mf_base_scorer.py", line 449, in _run_stable_matrix_factorization
return self._run_regular_matrix_factorization(ratingsForTraining)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/community-note/communitynotes/sourcecode/scoring/mf_base_scorer.py", line 424, in _run_regular_matrix_factorization
return self._mfRanker.run_mf(ratingsForTraining)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/community-note/communitynotes/sourcecode/scoring/matrix_factorization/matrix_factorization.py", line 560, in run_mf
self._lossModule = NormalizedLoss(
^^^^^^^^^^^^^^^
File "/root/community-note/communitynotes/sourcecode/scoring/matrix_factorization/normalized_loss.py", line 108, in init
assert all(ratings[labelCol].values == targets.numpy())
^^^^^^^^^^^^^^^
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
"""

To Reproduce I run the code in shell:

python3 main.py
--enrollment /root/community-note/enrollment/2024-12-29_20-02.tsv
--notes /root/community-note/note/2024-12-29_20-02.tsv
--ratings /root/community-note/rating/
--status /root/community-note/status/2024-12-29_20-02.tsv
--outdir /root/community-note/notescore
--parallel

Expected behavior I believe that this is due to normalized_loss.py, line 108 assert all(ratings[labelCol].values == targets.numpy())

I am not sure if I should change it to assert all(ratings[labelCol].values == targets.cpu().numpy())

Environment

  1. Same venv as in requirement
  2. NVIDIA H100 80GB HBM3 X2
  3. CUDA 12.2
  4. python 3.11.9
  5. Intel(R) Xeon(R) Platinum 8462Y+
  6. 516GB RAM

Jacobsonradical avatar Jan 16 '25 19:01 Jacobsonradical

We ran into the same issue. @tuler you successfully ran the code a few days ago. Did you encounter the same issue?

avalanchesiqi avatar Jan 25 '25 02:01 avalanchesiqi

@avalanchesiqi
I think there are two ways to solve this.

  1. install CPU pytorch, then pytroch automatically compute everything on CPU, no need to transfer tensor
  2. change the line to assert all(ratings[labelCol].values == targets.cpu().numpy())

Jacobsonradical avatar Jan 25 '25 19:01 Jacobsonradical

We ran into the same issue. @tuler you successfully ran the code a few days ago. Did you encounter the same issue?

No, I ran on CPU only.

tuler avatar Jan 25 '25 20:01 tuler

I think this could be a solution:

changing this:

assert all(ratings[labelCol].values == targets.numpy())

for this:

assert all(ratings[labelCol].values == targets.detach().cpu().numpy())

AntonioCoppe avatar Jun 12 '25 15:06 AntonioCoppe