Adding PyTorch functionality to SIRF
Changes in this pull request
The functionality:
- Functions for transferring between torch and sirf, and vice versa.
torch.autograd.Functionfor objective function, acquisition mode forward, and acquisition model backward. These are not really meant to be used by a user.torch.nn.Modulefor a module that interacts with thetorch.autograd.Function. In the forward method we could change the number of the dimensions in the array, swap in/out components of the forward operator etc.
This functionality of torch.nn.Module can vary dependent on the user's requirements, so perhaps we should just give a use case. Using it for lpd for example, or PET reconstruction with torch optimisers?
Testing performed
Gradchecks for all the new functions. Note that objective function evaluations are quite unstable.
Related issues
Checklist before requesting a review
- [x] I have performed a self-review of my code
- [x] I have added docstrings/doxygen in line with the guidance in the developer guide
- [x] I have implemented unit tests that cover any new or modified functionality
- [x] The code builds and runs on my machine
- [x] CHANGES.md has been updated with any functionality change
Contribution Notes
Please read and adhere to the contribution guidelines.
Please tick the following:
- [x] The content of this Pull Request (the Contribution) is intentionally submitted for inclusion in SIRF (the Work) under the terms and conditions of the Apache-2.0 License.
Added a README with the some notes for a design discussion here.
By the way, we should avoid running CI on this until it makes sense. Easiest is to put [ci skip] in the message of your last commit that you will push.
gschramm
For me, this description of the autograd in pytorch makes sense: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#optional-reading-vector-calculus-using-autograd
This explain why, in the backward pass of a vector valued function y = f(x), we have to implement the (Jacobian transpose) applied to a vector (supplied to the layer during back propagation).
If we write that out in components, we get the "chain rule".
thanks for the link @gschramm. We could maybe include it in the README. I think the main question for me is how much we should elaborate in our own doc, especially if it gets into detail on what is common usage vs what it actually does. I've put some detailed comments, and @Imraj-Singh will address those he agrees with :-). I'd personally prefer not too much (potentially confusing) detail in the README, but point to torch doc and our use cases (which could have a bit more doc then) or exercises.
Hi I am happy this. I think for me the readme is now understandable and the user should know what dimensionality to give tensors
@KrisThielemans, @casperdcl am I correct in thinking it's just CI left here, where we have a torch build and run the torch tests, then we can merge? Perhaps I can look at doing this at the hackathon?
Also seems to be some error already with CI: https://github.com/SyneRBI/SIRF/actions/runs/14112476864/job/39534509274?pr=1305 doesn't seem to be related to changes here?
Yes. As CI runs without Cuda, we need to have default for device, as indicated in the comments
Is this PR behind master? @evgueni-ovtchinnikov are you aware of these failing tests?
@KrisThielemans I noticed that it's good practice to add once_differentiable decorator so I have added that.
Also it seemed like the PR was behind main, I have resynced.
@Imraj-Singh @casperdcl @evgueni-ovtchinnikov Where are we with this? I see lots of errors
TypeError: float() argument must be a string or a real number, not '_NoValueType'
which don't occur in https://github.com/SyneRBI/SIRF/actions/runs/14908966112 for instance. No idea where they come frome.
It'd be great if we could get this across the finishing line now, especially since #1314 is merged. I see there's some conflicts, but they are trivial. It'd be great to replace as_array() with asarray() now.
Not sure about the above errors, but maybe they will have disappeared! 🤞
@Imraj-Singh want to give this merge a go? (shouldn't take more than 10 minutes)
Removing
from sirf.SIRF import *
from cmake/sirf.__init__.py.in (and hence sirf/__init__.py after installation) makes the normal SIRF tests work fine. However, presumably it will break the torch.py additions.
@casperdcl @evgueni-ovtchinnikov and I don't understand why this line creates trouble with numpy. A bit more information.
Errors are like
> if im.as_array().min()==im.as_array().max() or penalisation_factor == 0:
src/xSTIR/pSTIR/tests/tests_qp_lc_rdp.py:81:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
a = array([[[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ...,...0., -0., ..., -0., -0., -0.],
[-0., -0., -0., ..., -0., -0., -0.]]],
shape=(15, 211, 211), dtype=float32)
axis = None, out = None, keepdims = False, initial = <no value>, where = True
def _amin(a, axis=None, out=None, keepdims=False,
initial=_NoValue, where=True):
> return umr_minimum(a, axis, None, out, keepdims, initial, where)
E TypeError: float() argument must be a string or a real number, not '_NoValueType'
../../../virtualenv/lib/python3.10/site-packages/numpy/_core/_methods.py:48: TypeError
In some cases, we see
/home/runner/work/SIRF/SIRF/src/xSTIR/pSTIR/tests/test_ObjectiveFunction.py:20: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
import os, shutil, numpy
but not in all tests that fail (and in some that don't fail).
Somewhat related is https://github.com/numpy/numpy/issues/27857, which says that there is some weird conflict with numpy python/C going on.
I've tested this on my local system, and adding the import line there didn't create any trouble at all.
Also worth noting that in our current CI, pytorch isn't even installed.
Plan:
- keep the
importinsirf.__init__.pydisabled for now - use
import sirf.SIRF as sirfin the torch files - try to run CI using something similar to
add_test(NAME Torch_TESTS_PYTHON COMMAND ${Python_EXECUTABLE} -m pytest --cov=sirf.torch --cov-config=.coveragerc-Torch src/torch/tests WORKING_DIRECTORY ${CMAKE_SOURCE_DIR})
@casperdcl are you merging master on here?
one job got through! The other one is due to #1353 I'll fix that.
we now have
msg = sirf.STIR.MessageRedirector(info=None, warn=None, errr=None)
^^^^^^^^^
AttributeError: module 'sirf.SIRF' has no attribute 'STIR'
so I guess the import trick didn't work
Easily fixed that, but...
$ python use_cases.py
PET Varnet
*** Break *** segmentation violation
Seems to be crashing in STIR, so could be uncovering a STIR bug... sigh
The segfault happens when I install pytorch via conda. There are then 2 building problems, including missing linking errors with pthread, as in https://github.com/scikit-tda/ripser.py/issues/140, and two occurrences of omp.h. I might have deleted the wrong one.
If I install pytorch via pip, everything is fine. @casperdcl @paskino this might impact the PETRIC2 image.
Tests on my machine (no GPU), with current STIR and SIRF
pytest gradcheck.py
gives
13 passed, 2 skipped, 13 warning
which seems fine.
Running
python use_cases.py
first gives
$ python use_cases.py
PET Varnet
Error encountered: ??? "'PoissonLL accumulate_sub_Hessian_times_input: forward projection of input contains negatives. The result would be incorrect, so we abort.\\nSee https://github.com/UCL/STIR/issues/1461' exception caught at line 1329 of /home/kris/devel/cilsirfbuildnotorch/sources/SIRF/src/xSTIR/cSTIR/cstir.cpp; the reconstruction engine output may provide more information"
PET Varnet failed
...
This is expected and needs to be resolved in STIR. The output keeps going to
Learned Primal Dual
Iteration: 0 Loss: 18.655933380126953
Iteration: 1 Loss: 17.91448974609375
Iteration: 2 Loss: 16.89657974243164
...
Iteration: 46 Loss: 7.5694499015808105
Iteration: 47 Loss: 7.539045333862305
Iteration: 48 Loss: 7.500810146331787
Iteration: 49 Loss: 7.456661701202393
qt.qpa.plugin: Could not find the Qt platform plugin "wayland" in ""
Comparing Gradient Descents
Learning Rate: 0.1, Number of Iterations: 5
Gradient Descent with Acquisition Model
Iteration: 0 Loss: -16909088.0
Iteration: 1 Loss: -17123176.0
Iteration: 2 Loss: -17307868.0
Iteration: 3 Loss: -17467476.0
Iteration: 4 Loss: -17605596.0
Gradient Descent with Objective Function
Iteration: 0 Loss: -16909088.0
Iteration: 1 Loss: -17123178.0
Iteration: 2 Loss: -17307868.0
Iteration: 3 Loss: -17467476.0
Iteration: 4 Loss: -17605596.0
which seems fine, but then display the following uninspiring image
Closing that, I get
Sum of absolute differences between GD solns: 0.00039303303
which could be fine.
I then also find the following 2 png files:
-
learned_primal_dual
-
gd_comparison (a white image)
@Imraj-Singh is this all as expected?
vis. https://github.com/SyneRBI/PETRIC-backend/blob/main/Dockerfile which also uses pip :)
The lpd image seems okay, but the GD comparison doesn't seem correct. The code for the GD comparison is: https://github.com/Imraj-Singh/SIRF/blob/7c29a5c893b2ecfcce67b61e4ff0b79695652f64/src/torch/tests/use_cases.py#L295C1-L338C108.
Just as we see the image in the LPD, we should see the gd reconstruction too (although we should remove the plt.show() prior to the saving for GD). I am not sure what is happening. One difference between the images is in GD we instantiate torch.nn.Parameters so that they are parameters that are updatable via autograd. When we return the image: https://github.com/Imraj-Singh/SIRF/blob/7c29a5c893b2ecfcce67b61e4ff0b79695652f64/src/torch/tests/use_cases.py#L391. We have to use the ".data" to get the tensor from the parameter object. I don't know why that would cause an issue?
There could be a more obvious bug (we need more GD steps to start seeing an image?), but from my reading of the code the only difference between extracting the values for displaying is that ".data".
Hi @KrisThielemans, I thought this would be a relatively quick job, but I've had a lot of errors builiding. I am building on top of a pytorch image, with the dockerfile:
FROM pytorch/pytorch:2.6.0-cuda12.6-cudnn9-devel
RUN apt-get update
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
git \
wget \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
I attempt to install via SIRF-SuperBuild using:
cmake -S ./SIRF-SuperBuild -B ./build \
-DSIRF_SOURCE_DIR=/home/user/sirf_env/SIRF \
-DDISABLE_GIT_CHECKOUT_SIRF=ON \
-DBUILD_SIRF=ON \
-DBUILD_STIR=ON \
-DBUILD_Gadgetron=OFF \
-DBUILD_NIFTYREG=ON \
-DBUILD_CIL=OFF \
-DBUILD_ASTRA=OFF \
-DBUILD_pet_rd_tools=OFF \
-DBUILD_siemens_to_ismrmrd=OFF \
-DUSE_SYSTEM_pugixml=ON \
-DSIRF_EXTRA_CMAKE_ARGS="-DUSE_Gadgetron=OFF;-DUSE_NIFTYREG=ON;-DBUILD_SIRF_Registration=ON"
It built after making some changes outlined here.
I am getting intermittent errors when using the wrapped objective function:
Gradient Descent with Objective Function
double free or corruption (out)
Aborted (core dumped)
PET Varnet
Segmentation fault (core dumped)
malloc(): memory corruption (fast)
Aborted (core dumped)
On the occasion it does not crash I can tweak some of the use case parameters I get:
and:
All 2D PET grad tests passed, 3D PET grad tests take a very long time, and I haven't got gadgetron to work so I haven't tested.
The intermittent segfault is rather worrying... Any ideas? I think the pytorch is almost good to go, but after battling the build today I will try have another look tomorrow.
Sigh. build problems.......
You could have a look at https://github.com/SyneRBI/SIRF-SuperBuild/pull/946 for how @casperdcl managed to get things to build.
The memory problems are obviously very worrying. Ideally you'd build SIRF and STIR with RelWithDebInfo and run it with valgrind. It'll throw up a lot of (false positives?) in python etc, but hopefully it will also tell us where it's happening. Sorry.
"double free" sounds related to wrapping pointers without handling ownership properly :/
- Pre-rebase branch: https://github.com/SyneRBI/SIRF/tree/torch-prerebase
- Diff: https://github.com/SyneRBI/SIRF/compare/torch-prerebase...Imraj-Singh:SIRF:torch