dlib icon indicating copy to clipboard operation
dlib copied to clipboard

Bug fix in FFT tests when building with MKL

Open pfeatherstone opened this issue 1 year ago • 11 comments

pfeatherstone avatar Jan 25 '24 12:01 pfeatherstone

Maybe we should add apt install libmkl-dev to one of the runners

pfeatherstone avatar Jan 25 '24 12:01 pfeatherstone

~~for some reason, on my machine, the FFT tests break when both MKL and FFMPEG are used. Nothing to do with what this PR fixes. The break is in test_random_ffts() when using double and nr = 64, nc = 64. I'll post an issue once I'm sure it's not something with my setup. I've checked, the ffmpeg libraries don't use either fftw or mkl. They might use blas though, which will be pointing to mkl, and they might do some static initialization. I don't know. this is another good reason to enable mkl in CI/CD.~~

I think we definitely need to add MKL installation to CI/CD and check everything works. I can't figure out what's wrong with my machine, but there is something fishy going on with MKL (FFT and BLAS), OMP and ffmpeg. I'm definitely starting to see the benefit of something like Conan and/or docker for reproducible builds.

pfeatherstone avatar Jan 25 '24 13:01 pfeatherstone

There you go. I've added a runner that demonstrates this. My gut feeling is due to two different version of OMP being used. I'm not sure but I think libmkl_rt.so will dlopen intel's omp library at runtime, but libavformat and co use libgomp.so. Though dlib uses:

DftiSetValue(h, DFTI_THREAD_LIMIT, 1);

to make FFTs single threaded, i'm willing to bet it does the wrong thing when different versions of omp are being used at runtime. I'm not sure but i feel like this is it. @davisking can you check?

https://github.com/pytorch/pytorch/issues/19764 looks related

If you build with -DDLIB_USE_MKL_SEQUENTIAL=ON then everything works. By default, i think MKL will use omp, whether that's intel's omp or its own. I think the issue here is libmkl_rt.so is trying to use GNU omp, not its own, and failing miserably.

pfeatherstone avatar Jan 26 '24 09:01 pfeatherstone

I'm 100% convinced the problem is with OMP. When linking with ffmpeg libraries, it's inevitable that some library somewhere is linking to GNU OMP. If libmkl tries to use a bad omp library, then bad things happen. Not sure what the fix is.

EDIT: If i add mkl_set_num_threads(1); somewhere in the program, everything works. So it looks like DftiSetValue(h, DFTI_THREAD_LIMIT, 1); is not always enough. So maybe we need to do some static initialization somewhere. If you're doing a lot of BLAS stuff and you call this, you may substantially slow down your program. So I don't know. This is a mess.

pfeatherstone avatar Jan 26 '24 09:01 pfeatherstone

Possibly need additional runners with -DDLIB_USE_MKL_TBB and default behaviour.

pfeatherstone avatar Jan 26 '24 13:01 pfeatherstone

Yeah don't put mkl_set_num_threads(1) anywhere, that's going to surprise anyone using the MLK for other stuff in a bad way. I don't know what to do other than to say people have to link to groups of libraries that don't step on each other. It's kinda out of our hands if someone is doing that. But yeah sucks about ffmpeg pulling it in. That all said, if there is some cmake stuff we can add that somehow makes this easier for people then that would be lovely. You figure that end out yet?

davisking avatar Jan 26 '24 13:01 davisking

Well using MKL SEQUENTIAL definitely helps. When using apt installed MKL and ffmpeg, then everything is fine provided you use DLIB_USE_MKL_SEQUENTIAL. However, i'm still getting an issue on my machine when using a custom build of ffmpeg and using MKL, even with DLIB_USE_MKL_SEQUENTIAL. In this case, and on my machine, the FFT tests pass and the FFMPEG ones seg fault badly. It's possible dlib's find_blas.cmake script is still pulling in an MKL OMP library when it shouldn't. I'm not sure. Or it could be that the ffmpeg wrappers need to disable omp somehow if used. I thought libavformat and co just used pthread. But it's possible one of the libraries that IT depends on still uses omp. I'm not sure. This is tricky. What we don't want is for dlib to become a dependency manager tool. We have Conan and other tools for that. However, we want to make sure that a default build, which is likely to contain both MKL and FFMPEG on Intel machine will work reliably.

I think dlib's find_blas.cmake file needs tidying up. We need to clearly separate what is needed for BLAS, FFT and OMP. I would say we should make it impossible to pull in Intel's OMP library. Most libraries won't use that and won't be tested with it. It's possible that Intel's OMP and GNU's aren't ABI compatible.

pfeatherstone avatar Jan 26 '24 13:01 pfeatherstone

it looks like Pytorch has had similar issues in the past. https://github.com/pytorch/pytorch/issues/12535 The internet claims that INTEL and GNU OMP are ABI compatible though so i don't know anymore. What is clear is that there is an issue...

pfeatherstone avatar Jan 26 '24 13:01 pfeatherstone

Right i've done some tests on my home PC, which is ubuntu 18, so not quite the same as the runner, but anyways.

I've tested with:

  • MKL single dynamic library (mkl_rt)
  • MKL sequential
  • MKL tbb

And:

  • ffmpeg 4.4.2 (default on ubuntu 18)
  • ffmpeg 5.1.3 (built from source with default options)

On this setup, the only one that breaks is mkl_rt + ffmpeg 4.4.2. On my ubuntu 22 machine earlier today, my main issue again was with mkl_rt. So i think that's the common denominator when there is a dodgy OMP setup.

If someone else can test, that would be great.

pfeatherstone avatar Jan 26 '24 21:01 pfeatherstone

Also, on https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Using-MKL-Single-Dynamic-Library-mkl-rt/m-p/1500311 they mention the functions:

  • mkl_set_interface_layer()
  • mkl_set_threading_layer() when using mkl_rt. Do we need to call these explicitly?

pfeatherstone avatar Jan 26 '24 21:01 pfeatherstone

I bet they totally aren't ABI compatible.

What we don't want is for dlib to become a dependency manager tool.

Right, users constantly want dlib to do this for them. But I really really do not want to be in that game. I already get a ton of questions that are not about dlib but really "how do I install cuda?" or some other non-dlib thing.

It's possible dlib's find_blas.cmake script is still pulling in an MKL OMP library when it shouldn't

Doing make VERBOSE=1 will show you exactly what cmake is running. You should be able to see unambiguously what's being pulled in.

davisking avatar Jan 27 '24 15:01 davisking

I've just updated my home pc to ubuntu 22. I'll do some tests again. But my gut feeling is that this problem is avoidable if you explicitly link to either MKL sequential, tbb or gnu (I didn't realise there was a libmkl_gnu_thread until yesterday). If you link to the single dynamic library (libmkl_rt.so), then you can in theory get problems with OMP. If you use sequential or tbb, then in theory it won't touch omp.

So maybe the cmake script could link to libmkl_rt.so as a last resort, but then print a warning message saying "this could behave badly if your application also links to libgomp". Something like that.

As you said, dlib shouldn't manage dependencies, it should do what users tell it to do, and it's then their responsibility to manage their deps. We simply give them the option to link to BLAS, MKL, and if the latter, link to sequential, tbb or other. Up to them. If it breaks, it's their problem.

We could however use the MKL API to read the currently used interface using mkl_get_interface_layer() and mkl_get_threading_layer() on first use, and check if omp is being used (either at runtime or build time), then warn appropriately.

pfeatherstone avatar Jan 27 '24 17:01 pfeatherstone

Any kind of read on first use stuff requires global variables though and singletons. Which is fraught with problems though. I know it can be done, but the odds of that kind of thing interfering with someone else's use of these tools is pretty high. So I would definitely want to just have things link together right. Or if it really is impossible to use the IntelMKL with ffmpeg then we can just make cmake refuse to use the MKL if someone is asking it to link to ffmeg and print a message about it and why. It's really not on dlib to resolve the dependency clash those libraries are making.

davisking avatar Jan 28 '24 15:01 davisking

This PR is consumed in https://github.com/davisking/dlib/pull/2912

pfeatherstone avatar Feb 25 '24 09:02 pfeatherstone