SIRF icon indicating copy to clipboard operation
SIRF copied to clipboard

Adding PyTorch functionality to SIRF

Open Imraj-Singh opened this issue 10 months ago • 26 comments

Changes in this pull request

The functionality:

  • Functions for transferring between torch and sirf, and vice versa.
  • torch.autograd.Function for objective function, acquisition mode forward, and acquisition model backward. These are not really meant to be used by a user.
  • torch.nn.Module for a module that interacts with the torch.autograd.Function. In the forward method we could change the number of the dimensions in the array, swap in/out components of the forward operator etc.

This functionality of torch.nn.Module can vary dependent on the user's requirements, so perhaps we should just give a use case. Using it for lpd for example, or PET reconstruction with torch optimisers?

Testing performed

Gradchecks for all the new functions. Note that objective function evaluations are quite unstable.

Related issues

Checklist before requesting a review

  • [x] I have performed a self-review of my code
  • [x] I have added docstrings/doxygen in line with the guidance in the developer guide
  • [x] I have implemented unit tests that cover any new or modified functionality
  • [x] The code builds and runs on my machine
  • [x] CHANGES.md has been updated with any functionality change

Contribution Notes

Please read and adhere to the contribution guidelines.

Please tick the following:

  • [x] The content of this Pull Request (the Contribution) is intentionally submitted for inclusion in SIRF (the Work) under the terms and conditions of the Apache-2.0 License.

Imraj-Singh avatar Feb 15 '25 13:02 Imraj-Singh

Added a README with the some notes for a design discussion here.

Imraj-Singh avatar Feb 18 '25 20:02 Imraj-Singh

By the way, we should avoid running CI on this until it makes sense. Easiest is to put [ci skip] in the message of your last commit that you will push.

KrisThielemans avatar Feb 19 '25 08:02 KrisThielemans

gschramm

For me, this description of the autograd in pytorch makes sense: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#optional-reading-vector-calculus-using-autograd

This explain why, in the backward pass of a vector valued function y = f(x), we have to implement the (Jacobian transpose) applied to a vector (supplied to the layer during back propagation).

If we write that out in components, we get the "chain rule".

Screenshot 2025-03-04 at 08 47 20

gschramm avatar Mar 04 '25 07:03 gschramm

thanks for the link @gschramm. We could maybe include it in the README. I think the main question for me is how much we should elaborate in our own doc, especially if it gets into detail on what is common usage vs what it actually does. I've put some detailed comments, and @Imraj-Singh will address those he agrees with :-). I'd personally prefer not too much (potentially confusing) detail in the README, but point to torch doc and our use cases (which could have a bit more doc then) or exercises.

KrisThielemans avatar Mar 04 '25 11:03 KrisThielemans

Hi I am happy this. I think for me the readme is now understandable and the user should know what dimensionality to give tensors

danieldeidda avatar Mar 12 '25 11:03 danieldeidda

@KrisThielemans, @casperdcl am I correct in thinking it's just CI left here, where we have a torch build and run the torch tests, then we can merge? Perhaps I can look at doing this at the hackathon?

Also seems to be some error already with CI: https://github.com/SyneRBI/SIRF/actions/runs/14112476864/job/39534509274?pr=1305 doesn't seem to be related to changes here?

Imraj-Singh avatar Apr 06 '25 09:04 Imraj-Singh

Yes. As CI runs without Cuda, we need to have default for device, as indicated in the comments

KrisThielemans avatar Apr 06 '25 10:04 KrisThielemans

Is this PR behind master? @evgueni-ovtchinnikov are you aware of these failing tests?

KrisThielemans avatar Apr 06 '25 11:04 KrisThielemans

@KrisThielemans I noticed that it's good practice to add once_differentiable decorator so I have added that.

Also it seemed like the PR was behind main, I have resynced.

Imraj-Singh avatar Apr 06 '25 12:04 Imraj-Singh

@Imraj-Singh @casperdcl @evgueni-ovtchinnikov Where are we with this? I see lots of errors

 TypeError: float() argument must be a string or a real number, not '_NoValueType'

which don't occur in https://github.com/SyneRBI/SIRF/actions/runs/14908966112 for instance. No idea where they come frome.

KrisThielemans avatar May 09 '25 09:05 KrisThielemans

It'd be great if we could get this across the finishing line now, especially since #1314 is merged. I see there's some conflicts, but they are trivial. It'd be great to replace as_array() with asarray() now.

Not sure about the above errors, but maybe they will have disappeared! 🤞

@Imraj-Singh want to give this merge a go? (shouldn't take more than 10 minutes)

KrisThielemans avatar Jun 26 '25 14:06 KrisThielemans

Removing

from sirf.SIRF import *

from cmake/sirf.__init__.py.in (and hence sirf/__init__.py after installation) makes the normal SIRF tests work fine. However, presumably it will break the torch.py additions.

@casperdcl @evgueni-ovtchinnikov and I don't understand why this line creates trouble with numpy. A bit more information.

Errors are like

>               if im.as_array().min()==im.as_array().max() or penalisation_factor == 0:

src/xSTIR/pSTIR/tests/tests_qp_lc_rdp.py:81: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

a = array([[[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,...0., -0., ..., -0., -0., -0.],
        [-0., -0., -0., ..., -0., -0., -0.]]],
      shape=(15, 211, 211), dtype=float32)
axis = None, out = None, keepdims = False, initial = <no value>, where = True

    def _amin(a, axis=None, out=None, keepdims=False,
              initial=_NoValue, where=True):
>       return umr_minimum(a, axis, None, out, keepdims, initial, where)
E       TypeError: float() argument must be a string or a real number, not '_NoValueType'

../../../virtualenv/lib/python3.10/site-packages/numpy/_core/_methods.py:48: TypeError

In some cases, we see

  /home/runner/work/SIRF/SIRF/src/xSTIR/pSTIR/tests/test_ObjectiveFunction.py:20: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
    import os, shutil, numpy

but not in all tests that fail (and in some that don't fail).

Somewhat related is https://github.com/numpy/numpy/issues/27857, which says that there is some weird conflict with numpy python/C going on.

I've tested this on my local system, and adding the import line there didn't create any trouble at all.

Also worth noting that in our current CI, pytorch isn't even installed.

KrisThielemans avatar Sep 11 '25 10:09 KrisThielemans

Plan:

  • keep the import in sirf.__init__.py disabled for now
  • use import sirf.SIRF as sirf in the torch files
  • try to run CI using something similar to
    add_test(NAME Torch_TESTS_PYTHON
      COMMAND ${Python_EXECUTABLE} -m pytest --cov=sirf.torch --cov-config=.coveragerc-Torch src/torch/tests
      WORKING_DIRECTORY ${CMAKE_SOURCE_DIR})
    

KrisThielemans avatar Oct 30 '25 10:10 KrisThielemans

@casperdcl are you merging master on here?

KrisThielemans avatar Oct 30 '25 10:10 KrisThielemans

one job got through! The other one is due to #1353 I'll fix that.

KrisThielemans avatar Oct 30 '25 11:10 KrisThielemans

we now have

    msg = sirf.STIR.MessageRedirector(info=None, warn=None, errr=None)
          ^^^^^^^^^
AttributeError: module 'sirf.SIRF' has no attribute 'STIR'

so I guess the import trick didn't work

KrisThielemans avatar Oct 30 '25 12:10 KrisThielemans

Easily fixed that, but...

$ python use_cases.py
PET Varnet

 *** Break *** segmentation violation

KrisThielemans avatar Oct 30 '25 12:10 KrisThielemans

Seems to be crashing in STIR, so could be uncovering a STIR bug... sigh

KrisThielemans avatar Oct 30 '25 12:10 KrisThielemans

The segfault happens when I install pytorch via conda. There are then 2 building problems, including missing linking errors with pthread, as in https://github.com/scikit-tda/ripser.py/issues/140, and two occurrences of omp.h. I might have deleted the wrong one.

If I install pytorch via pip, everything is fine. @casperdcl @paskino this might impact the PETRIC2 image.

KrisThielemans avatar Oct 30 '25 20:10 KrisThielemans

Tests on my machine (no GPU), with current STIR and SIRF

pytest gradcheck.py

gives

13 passed, 2 skipped, 13 warning

which seems fine.

Running

python use_cases.py

first gives

$ python use_cases.py
PET Varnet
Error encountered: ??? "'PoissonLL accumulate_sub_Hessian_times_input: forward projection of input contains negatives. The result would be incorrect, so we abort.\\nSee https://github.com/UCL/STIR/issues/1461' exception caught at line 1329 of /home/kris/devel/cilsirfbuildnotorch/sources/SIRF/src/xSTIR/cSTIR/cstir.cpp; the reconstruction engine output may provide more information"
PET Varnet failed
...

This is expected and needs to be resolved in STIR. The output keeps going to

Learned Primal Dual
Iteration:  0 Loss:  18.655933380126953
Iteration:  1 Loss:  17.91448974609375
Iteration:  2 Loss:  16.89657974243164
...
Iteration:  46 Loss:  7.5694499015808105
Iteration:  47 Loss:  7.539045333862305
Iteration:  48 Loss:  7.500810146331787
Iteration:  49 Loss:  7.456661701202393
qt.qpa.plugin: Could not find the Qt platform plugin "wayland" in ""
Comparing Gradient Descents
Learning Rate: 0.1, Number of Iterations: 5
Gradient Descent with Acquisition Model
Iteration:  0 Loss:  -16909088.0
Iteration:  1 Loss:  -17123176.0
Iteration:  2 Loss:  -17307868.0
Iteration:  3 Loss:  -17467476.0
Iteration:  4 Loss:  -17605596.0
Gradient Descent with Objective Function
Iteration:  0 Loss:  -16909088.0
Iteration:  1 Loss:  -17123178.0
Iteration:  2 Loss:  -17307868.0
Iteration:  3 Loss:  -17467476.0
Iteration:  4 Loss:  -17605596.0

which seems fine, but then display the following uninspiring image image Closing that, I get

Sum of absolute differences between GD solns:  0.00039303303

which could be fine.

I then also find the following 2 png files:

  • learned_primal_dual learned_primal_dual

  • gd_comparison (a white image) gd_comparison

@Imraj-Singh is this all as expected?

KrisThielemans avatar Oct 30 '25 20:10 KrisThielemans

vis. https://github.com/SyneRBI/PETRIC-backend/blob/main/Dockerfile which also uses pip :)

casperdcl avatar Oct 30 '25 22:10 casperdcl

The lpd image seems okay, but the GD comparison doesn't seem correct. The code for the GD comparison is: https://github.com/Imraj-Singh/SIRF/blob/7c29a5c893b2ecfcce67b61e4ff0b79695652f64/src/torch/tests/use_cases.py#L295C1-L338C108.

Just as we see the image in the LPD, we should see the gd reconstruction too (although we should remove the plt.show() prior to the saving for GD). I am not sure what is happening. One difference between the images is in GD we instantiate torch.nn.Parameters so that they are parameters that are updatable via autograd. When we return the image: https://github.com/Imraj-Singh/SIRF/blob/7c29a5c893b2ecfcce67b61e4ff0b79695652f64/src/torch/tests/use_cases.py#L391. We have to use the ".data" to get the tensor from the parameter object. I don't know why that would cause an issue?

There could be a more obvious bug (we need more GD steps to start seeing an image?), but from my reading of the code the only difference between extracting the values for displaying is that ".data".

Imraj-Singh avatar Nov 05 '25 20:11 Imraj-Singh

Hi @KrisThielemans, I thought this would be a relatively quick job, but I've had a lot of errors builiding. I am building on top of a pytorch image, with the dockerfile:

FROM pytorch/pytorch:2.6.0-cuda12.6-cudnn9-devel
RUN apt-get update

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    wget \
 && apt-get clean \
 && rm -rf /var/lib/apt/lists/*

I attempt to install via SIRF-SuperBuild using:

cmake -S ./SIRF-SuperBuild -B ./build \
      -DSIRF_SOURCE_DIR=/home/user/sirf_env/SIRF \
      -DDISABLE_GIT_CHECKOUT_SIRF=ON \
      -DBUILD_SIRF=ON \
      -DBUILD_STIR=ON \
      -DBUILD_Gadgetron=OFF \
      -DBUILD_NIFTYREG=ON \
      -DBUILD_CIL=OFF \
      -DBUILD_ASTRA=OFF \
      -DBUILD_pet_rd_tools=OFF \
      -DBUILD_siemens_to_ismrmrd=OFF \
      -DUSE_SYSTEM_pugixml=ON \
      -DSIRF_EXTRA_CMAKE_ARGS="-DUSE_Gadgetron=OFF;-DUSE_NIFTYREG=ON;-DBUILD_SIRF_Registration=ON"

It built after making some changes outlined here.

I am getting intermittent errors when using the wrapped objective function:

Gradient Descent with Objective Function
double free or corruption (out)
Aborted (core dumped)
PET Varnet
Segmentation fault (core dumped)
malloc(): memory corruption (fast)
Aborted (core dumped)

On the occasion it does not crash I can tweak some of the use case parameters I get:

image

and:

image

All 2D PET grad tests passed, 3D PET grad tests take a very long time, and I haven't got gadgetron to work so I haven't tested.

The intermittent segfault is rather worrying... Any ideas? I think the pytorch is almost good to go, but after battling the build today I will try have another look tomorrow.

Imraj-Singh avatar Nov 18 '25 01:11 Imraj-Singh

Sigh. build problems.......

You could have a look at https://github.com/SyneRBI/SIRF-SuperBuild/pull/946 for how @casperdcl managed to get things to build.

The memory problems are obviously very worrying. Ideally you'd build SIRF and STIR with RelWithDebInfo and run it with valgrind. It'll throw up a lot of (false positives?) in python etc, but hopefully it will also tell us where it's happening. Sorry.

KrisThielemans avatar Nov 18 '25 12:11 KrisThielemans

"double free" sounds related to wrapping pointers without handling ownership properly :/

casperdcl avatar Nov 18 '25 14:11 casperdcl

  • Pre-rebase branch: https://github.com/SyneRBI/SIRF/tree/torch-prerebase
  • Diff: https://github.com/SyneRBI/SIRF/compare/torch-prerebase...Imraj-Singh:SIRF:torch

casperdcl avatar Nov 20 '25 10:11 casperdcl