remake icon indicating copy to clipboard operation
remake copied to clipboard

Random test failure in remake 4.3: output-sync deadlocks sometimes

Open dkogan opened this issue 5 months ago • 7 comments

Hi. I'm the Debian maintainer for remake. This bug report just came in to the Debian BTS:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1108812

This is on a clean Debian GNU/Linux machine on amd64. The reporter is observing the output-sync test fail. Sometimes. I poked around, and I see this as well, on his machine. What is supposed to happen is (from "tests/scripts/features/output-sync"):

#   foo/Makefile - has a "foo" target that waits for the bar target
#   bar/Makefile - has a "bar" target that runs immediately
#                - has a "baz" target that waits for the foo target

What happens when the test fails is that the "bar" target builds its prerequisite, but the recipe for "bar" itself is never run, so the "foo" and "baz" targets never complete, because "bar" never runs. I looked at various diagnostics, but cannot see why this is. We run "remake -j" so all the jobs should be queued. The logs (-x and -d) don't have any errors, the expected job just doesn't materialize.

I cannot reproduce this on my machine, and I cannot see this with make 4.3 either.

I'm attaching logs of "remake -x -d -j -Orecurse" for good and bad runs. The good run eventually does

Must remake target 'bar'

while the bad run does not. Any debugging suggestions? The reported might be willing to give you access to the failing box, if that is helpful. Thanks

log.bad.server.txt

log.good.txt

dkogan avatar Jul 08 '25 07:07 dkogan

Hello. Original reporter here.

I'm building all Debian packages from source to ensure that the end user will be able to rebuild them. Dima has asked me what's special about the machine where this test fails so often, and I really don't know. What I have observed is that on machines of type c7a.large, m7a.large or r7a.large from AWS, which incidentally have 2 vCPUs, the failure rate for the package is around 58%.

We would be particularly interested to know if this is a sign that the program is not behaving correctly, or just that the test is buggy and those failures should be ignored, in which case it would be quite easy for us (Debian) to disable the offending test in the Debian package.

Thanks.

sanvila avatar Jul 08 '25 11:07 sanvila

Thanks for the detailed information and packaging on Debian. I will try to look at this over the weekend.

Is this also happening on remake-4.4, the fork of GNU Make 4.4? Or on GNU Make 4.3?

Right now, I can't think of anything. I'll try to investigate over the weekend.

rocky avatar Jul 08 '25 11:07 rocky

Hello. I was not able to reproduce in Make 4.3, although remake 4.3 is affected. remake 4.4 isn't out yet so I have not tried. But if that's ready to test, I will try it shortly. Thanks.

dkogan avatar Jul 09 '25 03:07 dkogan

I just ran some tests using the undebianized source, directly from upstream. I see:

  • remake 4.3 fails. I build the remake-4.3+dbg-1.6 tag like this: ./autogen.sh && ./configure --disable-nls --enable-maintainer-mode && make && make check. It deadlocks every time on the features/output-sync test
  • remake 4.4 fails too. I checkout the remake-4-4 branch, built it the same way, and I see the same deadlock. Every time.
  • GNU Make 4.4.1 does NOT fail. I downloaded https://ftp.gnu.org/gnu/make/make-4.4.1.tar.gz, and built it like this ./configure --disable-nls && make && make check. It passes the features/output-sync test each time, but I've seen it fail the targets/SECONDARY test sometimes. That may or may not be related, and I've only seen that happen once, after running the test suite many times.

So I guess the good news is that on some machines the remake failure is VERY reproducible. Any other test to run before you get a chance to take a look?

Thanks

dkogan avatar Jul 09 '25 03:07 dkogan

Thanks - this is helpful. I will look at this this weekend and get back to you.

rocky avatar Jul 09 '25 10:07 rocky

I've started looking at this. Many tests in the 4.4 branch are broken.

It will probably take me a bit of time, but I'll slowly start to get these working again.

As for that particular features/output-sync test. In 4.3 the test was probably a little flaky since it's been changed in 4.4 to add more sleeps in the test. But even still it looks like there were improvements to part of the makes 4.4 output.c file that I haven't tracked. I have started doing this in branch remake-4-4-test-fixup but it will take me a while to fix up all of the tests.

rocky avatar Jul 11 '25 01:07 rocky

OK. I will disable this test in the Debian builds. Thanks for looking.

dkogan avatar Jul 11 '25 05:07 dkogan