trunk-recorder icon indicating copy to clipboard operation
trunk-recorder copied to clipboard

Fatal crash: boost "Invalid cross-device link"

Open PhilRW opened this issue 1 year ago • 9 comments

  • Linux <hostname-redacted> 5.10.0-25-amd64 #1 SMP Debian 5.10.191-1 (2023-08-16) x86_64 GNU/Linux
  • Docker version 24.0.5, build ced0996
  • Using edge docker image from docker hub
  • /data is mounted volume

Program crashes with the following:

boost::filesystem::copy_file: Invalid cross-device link: "/dev/shm/codtrs5/9580-1693338509_852987500.wav", "/data/codtrs5/2023/8/29/9580-1693338509_852987500.wav"
0x7f443a0325d9: (gr::tagged_stream_block::check_topology(int, int)+0x2e49)
0x7f4439c2f24c: (std::rethrow_exception(std::__exception_ptr::exception_ptr)+0x7c)
0x7f4439c2f2b7: (std::terminate()+0x17)
0x7f4439c2f23e: (std::rethrow_exception(std::__exception_ptr::exception_ptr)+0x6e)
0x5595a221e941: (Call_Concluder::manage_call_data_workers()+0xeb1)
0x5595a2140604: (monitor_messages()+0x394)
0x5595a2134210: (main+0x740)
0x7f443987bd90: (__libc_init_first+0x90)
0x7f443987be40: (__libc_start_main+0x80)
0x5595a2137ab5: (_start+0x25)

PhilRW avatar Aug 29 '23 19:08 PhilRW

Problem seems to be mitigated by setting transmissionArchive to false.

PhilRW avatar Aug 29 '23 19:08 PhilRW

If you still want to keep transmission archives, the other option is to set tempDir to the same directory (or at least drive) as captureDir in the config file. Keeping both of those on the same device should avoid the issue, but you'll miss any benefit of recording all the individual transmissions to a tempfs instead of storage media.

There are a handful of boost library/kernel combos that can cause this, but it's ultimately related to a kernel issue that existed between linux 5.3 and 5.18. Boost created a workaround at some point, and it was fixed in the 6.x kernel, but some distros like debian 11 might still run into the "cross-device link" error.

Since this only really happens under a certain set of circumstances, it might even be best that transmissionArchive: true disables the use of a temp space. If you're keeping all those wavs, its not like the tempDir is saving any drive wear, it's just adding complexity.

taclane avatar Aug 29 '23 20:08 taclane

Just for posterity's sake I'd like to confirm taclane's findings. My main recorder ran the TR official docker image on a Debian 11 box with a backported 6.x kernel and still ran into this error. It was configured to archive transmissions and tempDir wasn't set - I configured it to use a directory on the same volume as the existing audio storage and I can now run newer code without problem.

sally-yachts avatar Oct 02 '23 17:10 sally-yachts

For more context, this still happens with the latest edge code on a fresh Debian 12 (bookworm) install with kernel 6.1.0-13. Would love any input on known working boost/kernel versions to address this as using something like shm for temp data keeps latency-sensitive IO off of storage altogether which enables a lot more flexibility in deployment.

This workaround also unfortunately triggered a corner case in concert with bad firmware from Samsung and caused two brand new SSDs to burn through their usable life in a couple months necessitating RMA.

sally-yachts avatar Dec 04 '23 16:12 sally-yachts

It was a little convoluted to map out, but for those using transmissionArchive, the problem seems be along the lines of:

The current boost::filesystem::copy_file will error if BOTH:

  • boost < 1.76
  • linux kernel 5.3 or greater (6.x included)

But std::filesystem::copy_file will only error if:

  • linux kernel 5.3 or greater (6.x NOT included)

#886 should address this by checking the boost version, and attempting a std::filesystem::copy_file if detects that the boost library hasn't been updated yet. If the installed boost lib is new enough, it will use that instead, which should be a better workaround for anyone using kernel 5.3-5.18.

I just tried this with kernel 6.5 / libboost 1.74, and it prevented a previous error from occurring as the transmission wavs were copied out of the /dev/shm tempfs to disk.

taclane avatar Dec 06 '23 00:12 taclane

Pulled the latest docker image (edge tag that includes #886), let tempDir default back to shm, and it ran all night without an issue. Good stuff!

sally-yachts avatar Dec 08 '23 16:12 sally-yachts

Cool!
All that's left is to pull in #887 to fix a typo for boost compatibility going forward (>1.76), and that should hopefully be the end of this issue.

taclane avatar Dec 08 '23 16:12 taclane

MERGED!! 3 Cheers to @taclane for squashing this bug 🎉

robotastic avatar Dec 08 '23 16:12 robotastic

Looks like there still might be a race condition hiding in the workaround somewhere; I get crashes about every 24-36hrs that seem to reference copying a transmission from temp to archive but the file already exists. The docker image still has boost 1.74 so I expect if we can bump that up to something newer than 1.76 then it'll probably defuse the landmine for good.

sally-yachts avatar Jan 06 '24 18:01 sally-yachts