torchsparse How can I use a model built with the TorchSparse library in C++?

Is there an existing issue for this?

[X] I have searched the existing issues

Have you followed all the steps in the FAQ?

[X] I have tried the steps in the FAQ.

Current Behavior

I want to build and infer in C++ a Torchsparse model trained in Python.

How do I do this?

I know that when downsampling and upsampling are mixed in Sparse Convolution, the graph cannot be traced using torch.jit.trace.

Error Line

No error lines.

Environment

- PyTorch: 1.12.1
- PyTorch CUDA: 11.7

Full Error Log

No response

Mar 07 '23 09:03 sun-sey

This is not yet supported. Could you kindly let us know your use case? Thanks!

Mar 10 '23 06:03 zhijian-liu

Hi, I also had the same questions, did some digging to see what's needed.

Usecases:

Inference with libtorch (C++), where the runtime performance for inference is even better (lesser overheads) -> i.e. realtime inference stuff. At least for RnD purposes.
This also requires JIT support and when all layers are jittable, some performance improvements can be leveraged.
With some cleanups and improvements, exporting the model to TensorRT is also made easier***

Usage Workflow:

Export the model to jit -> torch.jit.script(model).save('output.pt')
Build the C++ application and link against the torchsparse library
Register the operations (automatically done if TORCH_LIBRARY macro is used instead of pybind11 macro)
Load the graph and call forward()

TODOS:

Export the backend operations via TORCH_LIBRARY macro (instead of pybind11 macro)
- This registers the ops for both C++ and Python.
- This makes the JIT compiler to see the operations (jit scripted graph is required for C++ export).
- Minor changes to the APIs, replace int with int64_t, etc
The bigger changes are in the python classes and functional API:
- Might need a rewrite or have a jittable() like implementation in PyG https://pytorch-geometric.readthedocs.io/en/latest/advanced/jit.html
- No dynamically sized tuples, think of C++ tuples -> element count is same. (Tuple[int, ...] is not allowed)
- ^kmap dicts need some changes in the key and value types. Either fix the tuple sizes for strides or use a larger tuple and full the redundant ones with -1 ((1,1,1,-1) for 3D while this supports upto 4D)
- No global variables! -> specifically the buffer used for the faster algorithm. This is not allowed. Perhaps change the API to pass a buffer ?

So far I managed the first part, it was quite straightforward, replaces the int parameters of the functions with int64_t and float with double.

I'm trying out the second phase now, not sure how it will end up. If someone is interested I can open PRs for each part.

Sep 01 '23 22:09 tuxbotix

Small update, I managed to port the current master branch for jit compilation. It is still a bit hacky and I'm planning to open a PR once 2.1 sources are out (#236 & #237) to integrate those changes.

Timing:

With the examples/example.py I got 1.5-2x speed difference in C++ inference (jit exported) Vs Python (with and without JIT). I expect even further improvements with torchsparse 2.1 algorithms.

What had to be done:

Use TORCH_LIBRARY macro instead of pybind11 for bindings
Port conv3d autograd function to C++ and export with the above macro (https://github.com/pytorch/pytorch/issues/69741)
change/ remove all Tuple[int, ...] to List[int] -> dynamically sized tuples (in python) are pretty much std::vectors in C++ side whereas std::tuples must be fixed sized.
change dict key type to str in SparseTensor -> I just did f"{list(input.stride}|..." and didn't notice significant performance penalty, there are also other optimization opportunities

Once all this was accomplished, the shared object can be loaded to a C++ program with dlopen or link against it and then load the exported JIT compiled model.

Sep 15 '23 15:09 tuxbotix

Small update, I managed to port the current master branch for jit compilation. It is still a bit hacky and I'm planning to open a PR once 2.1 sources are out (#236 & #237) to integrate those changes.

Timing:

With the examples/example.py I got 1.5-2x speed difference in C++ inference (jit exported) Vs Python (with and without JIT). I expect even further improvements with torchsparse 2.1 algorithms.

What had to be done:

Use TORCH_LIBRARY macro instead of pybind11 for bindings

Port conv3d autograd function to C++ and export with the above macro (JIT: Support for torch.autograd.functional.jacobian in TorchScript pytorch/pytorch#69741)

change/ remove all Tuple[int, ...] to List[int] -> dynamically sized tuples are pretty much std::vectors in C++ side where tuples must be fixed sized.

change dict key type to str in SparseTensor -> I just did f"{list(input.stride}|..." and didn't notice significant performance penalty, there are also other optimization opportunities

Once all this was accomplished, the shared object can be loaded to a C++ program with dlopen or link against it and then load the exported JIT compiled model.

looking foward for your work

Sep 16 '23 04:09 FengYuQ

Any update or work-in-progress branch? I am interested in this feature as well.

Dec 16 '23 00:12 fchou-labs

Is there any update?

Mar 20 '24 08:03 cama36

Hi,

My apologies for the late reply. I managed to get the conversion to work by following the recipe I shared above.

However, I'm unable to share that implementation due to reasons. Furthermore, I realised this needed some fundamental changes in the library implementation and should better be done with the coordination of the original authors.

Quick start

The process requires the previously mentioned steps. Most crucially:

Use TORCH_LIBRARY macro instead of directly using Pybind11.

The secondary advantage I found is that the resulting binary has no Python dependencies resulting in a smaller size.
It was also directly loadable to a C++ program and can call the torch ops.
https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html#using-the-torchscript-custom-operator-in-c

Make the Python code TorchScript compliant

This took most of the time! Mainly because this codebase relied a lot on duck typing and also the nature of torchscript being 'statically typed'.
But this was greatly helped by the torchscript compiler!
I wrote a small unit test like the one below, targeting the torchsparse conv3d operator. One could also start with a smaller scope (i.e. SparseTensor, etc) and fix things.
```
import torch
import torchsparse

def test_torch_jit_conv():
  scripted_conv3d = torch.jit.script(torchsparse.nn.functional.conv3d) # This is *just* an example
```

Test driven development is crucial

In addition to the tests by the authors, extra tests were written for JIT conversion and comparing the JIT variant vs original variant to ensure nothing was changed during the scripting.

Results:

Tested a real world model and the examples on both Python & C++
While ImplicitGEMM is good for training and inference, the newly implemented Fetch on Demand approach is more memory efficient and faster for inference only usecases.

Retrospection

I'm not 100% sure if TorchScript is the way to go for this. On one hand, it required significant changes to the codebase, also it was rather intrusive.

An alternative I am considering is to convert to ONNX via Torch Dynamo backend (which uses FX tracing -> different set of constraints) and mark the torchscript ops as custom ops. Then the runtime (i.e. TensorRT) can load them as a plugin.

Both approaches (torchscript and TensorRT) has their advantages and drawbacks. Both will also need significant changes of the library itself. If there is any more interest, we can start breaking this down to small parts and get it up :)

Mar 28 '24 11:03 tuxbotix