vision icon indicating copy to clipboard operation
vision copied to clipboard

About RegisterOperators in C++

Open zsef123 opened this issue 5 years ago • 26 comments

🐛 Bug

I'm testing about jit in C++ ( libtorch )

My test cases

int main(int argc, const char* argv[]) {
    auto& ops = torch::jit::getAllOperators();
    std::cout << "torch jit operators\n";
    for (auto& op: ops) {
        auto& name = op->schema().name();
        if (name.find("torchvision") != std::string::npos)
            std::cout << "op : " << op->schema().name() << "\n";
    }
    std::cout << "\n";
    return 0;
}

>>> torch jit operators

Expected (I changed some codes in vision)

>>> torch jit operators
op : torchvision::_cuda_version
op : torchvision::deform_conv2d
op : torchvision::ps_roi_pool
op : torchvision::ps_roi_align
op : torchvision::_new_empty_tensor_op
op : torchvision::roi_pool
op : torchvision::roi_align
op : torchvision::nms

In 1st case, seems no jit opeators registry on libtorch.

Is this only occurs my env? or some codes are wrong?

Because I checked expected output when I changed srcs.

Environment

Install command follows on reamdme

torch 1.5.0 and torchvision 0.6.0

zsef123 avatar Apr 23 '20 10:04 zsef123

This issue is better discussed on the forum. I will close this issue, but please feel free to re-open with more information and details about what was changed so that we can reproduce.

Moreover, if you encounter an issue with jit, I encourage you to open an issue on pytorch instead.

vincentqb avatar Apr 23 '20 17:04 vincentqb

@vincentqb This is not pytorch problem. It's about call the registry API when using C++ torchvision.

And the problem is registry parts not calling when the custom project includes Torchvision

So I moved registry parts to header, then the project is running registry codes when compile.

Could you check the jit operators in C++?

zsef123 avatar Apr 24 '20 01:04 zsef123

If I understand correctly: you have one version that works, and another that doesn't? Can you provide a complete minimal example to run with how you compile so we can reproduce?

cc @fmassa

vincentqb avatar Apr 25 '20 00:04 vincentqb

@vincentqb Basically I just install by cmake and running hello_world And Now I'm trying to 0.6, but 0.5 has same problem

  1. I installed following README.rst
cmake -DCMAKE_PREFIX_PATH="/home/user/torch_build/libtorch150" -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CUDA=on ..
make -j16
make install
  1. collect_env.py
user(t3) in ~/torch_build via env150 
➜ python3 collect_env.py                  
Collecting environment information...
PyTorch version: 1.5.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 8.3.0-16ubuntu3~16.04) 8.3.0
CMake version: version 3.17.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 418.87.01
cuDNN version: /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip3] numpy==1.18.3
[pip3] torch==1.5.0+cu101
[pip3] torchvision==0.6.0+cu101
[conda] Could not collect
  1. main.cpp
#include <iostream>

#include <torchvision/vision.h>

int main()
{
  auto& ops = torch::jit::getAllOperators();
  std::cout << "torch jit operators\n";
  for (auto& op: ops) {
      auto& name = op->schema().name();
      if (name.find("torchvision") != std::string::npos)
          std::cout << "op : " << op->schema().name() << "\n";
  }
  std::cout << "\n";
  return 0;
}

And I running upper main function.

I expect the torch::jit::getAllOperators in main can find operators related with torchvision. But In my env, that shows nothing.

So I moved the torch::RegisterOperators() parts move to header(.h) not .cpp If do this, compiler directly read the registry parts.

And then, shows torchvision methods as my expected.

In my experience, I same got problem on my custom kernel when I using jit. So I changed registry called directly, then I can running fine and can find my kernel in getAllOperators

zsef123 avatar Apr 25 '20 01:04 zsef123

Thanks for the report @zsef123 !

ccing @suo , as I believe this is currently an issue with the torchscript RegisterOperators (or our use of it), which makes the operators not readily available for the users, who need instead to re-register the ops again in user code.

fmassa avatar Apr 27 '20 11:04 fmassa

@fmassa So the vision lib purposes each user manually register ops in projects not automatically registered by include torch header.

thats right? I think I mis-understood that design

zsef123 avatar Apr 27 '20 12:04 zsef123

That's currently what's needed, but I don't think this is a desirable setup. Maybe we just need to move the RegisterOperator calls from the C++ file into the headers in torchvision, so that they get registered automatically once the user includes it.

fmassa avatar Apr 27 '20 13:04 fmassa

Let's keep this issue open until we figure out a better way of handling this

fmassa avatar Apr 27 '20 13:04 fmassa

I think what's happening is that the linker is throwing out the static registration of the operators when linking in the torchvision lib, since the RegisterOperator object is never referenced from the main executable. Can you try linking in torchvision with -Wl,--whole-archive?

suo avatar Apr 27 '20 19:04 suo

@suo Thanks to comments!!

I'm trying to like under codes.

CMakeLists.txt

cmake_minimum_required(VERSION 3.10)
project(hello-world)

find_package(TorchVision REQUIRED)

add_executable(hello-world main.cpp)
target_link_libraries(hello-world
    -Wl,--whole-archive
    TorchVision::TorchVision
)

But this errors occurs

[ 50%] Linking CXX executable hello-world
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_muldi3.o): In function `__multi3':
(.text+0x0): multiple definition of `__multi3'
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_muldi3.o):(.text+0x0): first defined here
/usr/lib/gcc/x86_64-linux-gnu/8/libgcc.a(_negdi2.o): In function `__negti2':
(.text+0x0): multiple definition of `__negti2'

...

Is this my setup problem?

user(t3) in vision/examples/cpp/hello_world/build on  master [!?] via env150 took 2s 
➜ cmake --version                          
cmake version 3.17.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

user(t3) in vision/examples/cpp/hello_world/build on  master [!?] via env150 
➜ gcc --version                               
gcc (Ubuntu 8.3.0-16ubuntu3~16.04) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

zsef123 avatar Apr 29 '20 06:04 zsef123

whole-archive is not the right flag. The real problem is because your executable isn't using any symbols from torchvision, the linker is pruning it away. You can verify that this is the case by running ldd on the resulting executable; you will see that there is a dependency on libtorch.so but not on torchvision. If you specify

target_link_libraries(hello-world
    -Wl,--no-as-needed
    TorchVision::TorchVision
)

This will force the linker to include.

@fmassa we should consider updating the TorchVision cmake target to automatically specify --no-as-needed, this will help prevent this common failure mode.

ezyang avatar Apr 30 '20 18:04 ezyang

@bmanga could you look into adding --no-as-needed in the torchvision CMakeLists?

fmassa avatar May 05 '20 17:05 fmassa

@fmassa May I suggest to use an inline variable or a function to register the operators instead? I can give that a try first, so we can avoid hacks in the cmake scripts.

bmanga avatar May 05 '20 18:05 bmanga

@ezyang what do you think?

I think it might be preferable to avoid having to ask users to register_torchvision_ops() in their C++ files.

fmassa avatar May 05 '20 18:05 fmassa

@fmassa Inline variables should work just like the current approach. Officially they are C++17, but we can get essentially identical functionality using __attribute__((weak)).

I personally don't mind a RegisterOps function. I think there might be users of torchvision that don't need them, and for those who do it's just a function call in their main.

Either approach is IMO better than using linker hacks, otherwise we end up with a mess like pytorch's cmake scripts. As a bonus, we can even remove the hardcoded requirement that the lib must be built as a shared library.

EDIT: I think __attribute__((constructor)) is exactly what we need here. I will try to create a PR tomorrow.

bmanga avatar May 05 '20 20:05 bmanga

@bmanga I don't think __attribute__((constructor)) will work? It will put things in the ctors section of your library, which is the same thing that static initializers do, and the linker will prune it away if you literally don't depend on any symbols.

There's nothing wrong with having a register operators function. It doesn't do any harm, and it will save you from having to do --no-as-needed. But, for better or worse, PyTorch as a library relies a LOT on static initializers, and I think it's better to fall in line with the prevailing convention as a downstream library (torchvision). So I am advocating that for the default behavior, we should pass in the right linker flags. It is not too much cmake.

ezyang avatar May 06 '20 00:05 ezyang

cc'ing @malfet too

ezyang avatar May 06 '20 00:05 ezyang

@ezyang true, late night thoughts should be left for the morning 😅

I think it is a bit unreasonable to expect the operators to be registered just by linking the library without even including a single header. Is that really a common case?

What about the following?

  • Have the user include any header from the library, even just vision.h
  • Document in the README that if you are only planning to use torchvision ops, you must either include something or add set_target_properties(exe PROPERTIES LINK_WHAT_YOU_USE TRUE) to the torchvision consumer target.

If you think that is still not acceptable, I will copy the code from caffe2_interface_library macro.

bmanga avatar May 06 '20 10:05 bmanga

Hi @bmanga

Thanks for the discussion. After seeing the discussion here, I think it might be preferable to follow PyTorch's convention.

@bmanga can you clarify why we would need to copy the code from caffe2_interface_library, instead of just passing --no-as-needed?

fmassa avatar May 06 '20 17:05 fmassa

@fmassa because if we really want this behavior

  • it should be done for windows, mac and linux
  • We want to limit it to torchvision only, so we need to add "-Wl,--no-as-needed,libtorchvision.so,-Wl,--as-needed"
  • If we can, it would be good not to tie ourselves into requiring that torchvision be built as a shared library

caffe2_interface_library mostly solves these problems, although it is not completely correct either because it assumes that the user target has --no-as-needed enabled by default, which may not be the case and be a source of other issues.

bmanga avatar May 06 '20 19:05 bmanga

I agree that writing the relevant cmake code portably is not so easy. I am a bit curious how the #include trick would work; since I feel like you still have to use the symbols somehow, just externing this isn't enough. (Maybe if the header sets some weak symbol)?

ezyang avatar May 06 '20 21:05 ezyang

@ezyang yes, that's what I had in mind when I mentioned inline variables. In the vision.h header, we could have something like:

#if (defined __cpp_inline_variables) || __cplusplus >= 201703L
#define VISION_INLINE_VARIABLE inline
#else
#ifdef _MSC_VER
#define VISION_INLINE_VARIABLE __declspec(selectany)
#else
#define VISION_INLINE_VARIABLE __attribute__((weak))
#endif
#endif

namespace vision {
namespace impl {
int RegisterOps() noexcept;
VISION_INLINE_VARIABLE int dummy = RegisterOps();
}
}

And that should be enough to ensure that the library is loaded (RegisterOps is defined in a cpp file), and that the code is run only once.

bmanga avatar May 07 '20 07:05 bmanga

It feels a bit dirty doing this, but I'm willing to give it a try.

ezyang avatar May 08 '20 01:05 ezyang

Hi. I am a maintainer of the Homebrew package manager. We have been struggling to update to pytorch 2.1.0 and torchvision 0.16.0 since a few weeks now. Both build fine, but we have a small test that fails to run for torchvision.

We build the following binary as a small test:

#include <assert.h>
#include <torch/script.h>
#include <torch/torch.h>
#include <torchvision/vision.h>
int main() {
  auto& ops = torch::jit::getAllOperatorsFor(torch::jit::Symbol::fromQualString("torchvision::nms"));
  assert(ops.size() == 1);
}

We are then building / linking with:

usr/bin/g++-11 -std=c++17 test.cpp -o test -fopenmp -I/home/linuxbrew/.linuxbrew/opt/pytorch/include -I/home/linuxbrew/.linuxbrew/opt/pytorch/include/torch/csrc/api/include -I/home/linuxbrew/.linuxbrew/Cellar/torchvision/0.16.0/include -L/home/linuxbrew/.linuxbrew/opt/pytorch/lib -Wl,-rpath,/home/linuxbrew/.linuxbrew/opt/pytorch/lib -ltorch -ltorch_cpu -lc10 -Wl,--whole-archive -L/home/linuxbrew/.linuxbrew/Cellar/torchvision/0.16.0/lib -ltorchvision

Unlikily the operators are not registered. I see various pull requests have been merged and there was some back and forth, but it is unclear for me from all that if there was a change in behaviour with the latest torchvision release. Is there anything special we need to do (register the operators manually? Use -Wl,--whole-archive ?

Any help would be appreciated so we can update the package in Homebrew. Let me know if it's ok to revive this old issue, or if I should open a separate issue?

iMichka avatar Nov 05 '23 21:11 iMichka

You need --no-as-needed

ezyang avatar Nov 06 '23 13:11 ezyang

Cool; thanks that worked! :)

iMichka avatar Nov 12 '23 10:11 iMichka