bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Random bcftools concat CI failures on Windows

Open jkbonfield opened this issue 2 years ago • 4 comments

Even with the same commit hash, sometimes make check works, and sometimes not. For example:

https://ci.appveyor.com/project/jkbonfield/bcftools/builds/46710244#L6077 https://ci.appveyor.com/project/samtools/bcftools/builds/46694487#L6077

This test does both bcf and vcf. In both cases the vcf worked, although oddly there's a rogue newline in the command line reported for the second URL above in the vcf command line. This may mean some bizarre line-wrapping buglet going on? I'm really not sure.

However one other difference in the first test_naive_concat bcf test is the use of --naive-force over --naive. The documentation claims to use this with caution as no compatibility check is done. Given this is a random error and the tests take 10 minutes to run, we can only run one at a time, and it takes lots of tests to get one failing, this isn't an ideal thing to experiment with.

The BCF headers contain things like dates and times. We rapidly create 10 test files in a row:

    for my $file (@files)
    {
        cmd("$$opts{bin}/bcftools view -Ob -o $file.bcf $file.vcf");
        cmd("$$opts{bin}/bcftools view -Oz -o $file.vcf.gz $file.vcf");
    }

What happens if the time changes between two of those files? It may give different compressed sizes of the headers, and then give different byte offsets for downstream data? I'm clutching at straws because I've no idea what --naive-force is actually warning about. Is it literally a rapid byte copying function, or is it doing decode and reencode jobs and it's irrelevant if the headers differ slightly, as long as they have the same contig names and lengths?

Diagnosing this remotely like this is almost impossible. Locally I can't get it to fail at all, but locally it also runs much quicker, so that's also that got me wondering if it's timestamps that are causing it to break. I see some code has --no-version options. So maybe that would cure it.

Any clues welcomed as I'm just making wild stabs in the dark at the moment.

jkbonfield avatar Apr 05 '23 16:04 jkbonfield

Argh, this is another impossible to reproduce random bug!

On my local windows box I tried:

$ for i in `seq 1 100`;do TEST_OPTS="-t tests.out -f test_naive_concat" make test-plugins 2>&1 | grep failed;done
    failed  .. 0
[repeated 100x over]

Yet we know this test does fail sometimes on AppVeyor. Eg https://ci.appveyor.com/project/samtools/bcftools/builds/46694487#L6079

I'll try it again after modifying the perl script to force the seed to 3434840604. (So far it's working although not many iterations done, but even so it's not turning it into an always-fail bug.)

Edit: no errors there either.

So for faster iteration I ran through the actual command it's executing 1000 times and that also works:

$ for i in `seq 1 1000`;do ./bcftools.exe concat --naive-force tests.out/naive_concat.*.bcf 2>/dev/null|wc -c;done | uniq
377537

Yet clearly the 7th file on appveyor had a bgzf error:

	Non-zero status 127
		[W::bcf_sr_add_reader] No BGZF EOF marker; file '-' may be truncated
		[E::bgzf_read_block] Invalid BGZF header at offset 0
		[E::bgzf_read] Read block operation failed with error 2 after 0 of 5 bytes
		[E::bcf_hdr_read] Failed to read the header (reading BCF in text mode?)
		Failed to read from standard input: could not parse header
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.0.bcf	0.003108 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.1.bcf	0.002063 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.2.bcf	0.001913 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.3.bcf	0.000318 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.4.bcf	0.001813 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.5.bcf	0.003632 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.6.bcf	0.002178 seconds
		Concatenating C:/msys64/tmp/TOCMppXCxh/naive_concat.7.bcf
.. failed ...

Rob was reporting errors with his fixes-1.17 branch too, in the exact same test. Although his time it failed on the 4th bcf file. Yet other times this same code works on AppVeyor.

So clearly this is a genuine intermittent bug as the fail twice on the same test in the same place is vanishingly small for it to just be a random system glitch. Sadly I can't work out how to reproduce it locally though.

My exhaustive testing has been on the concat command, but that is very robust. So I'm now assuming it's more likely to be the generation of the data that the concat command is reading that has silently failed. Any clues welcome.

jkbonfield avatar Apr 06 '23 08:04 jkbonfield

I should have thought about those errors more carefully. Specifically:

[W::bcf_sr_add_reader] No BGZF EOF marker; file '-' may be truncated

That coming from blah | bcftools view -H. The question that needed asking is, why is it validating if the file has EOF when reading from a pipe?

The answer to this, surprisingly, is that on newer mingw releases lseek on a pipe no longer returns -1! Eg:

#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>

int main(int argc, char **argv) {
  printf("Seek END returned %ld\n", (long)lseek(0, 0, SEEK_END));
  return 0;
}

Now test it on a small file, with "< file" (fd 0 is a real file on disk) vs "cat file | ..." (fd 0 is a pipe):

jkbon@nagafen MINGW64 ~/bcftools
$ ./a.exe < _
Seek END returned 226472
# the actual file length

jkbon@nagafen MINGW64 ~/bcftools
$ cat _ | ./a.exe
Seek END returned 131072
# the size of a pipe buffer?

This appears to have changed behaviour somewhere between mingw gcc 11.2 and gcc 12.2, but it's likely not the compiler per se and rather somewhere in the standard C library. I'm struggling to find another route for validating this. We could potentially modify the hfile struct somehow so stdin sets a flag which is then explicitly checked later, but given this is just validation of input my inclination is to fix this with a #ifdef __MINGW32__ guard and simply avoid the problematic behaviour. I filed this as a bug with mingw.

jkbonfield avatar Apr 11 '23 10:04 jkbonfield

As to why this is randomly failing, I think it's because most of the files being concatenated are < 128Kb long so the SEEK_END -28 actually works, but occasionally it gets a long one (they're randomly generated with a different seed every time), which then seeks to the wrong location.

jkbonfield avatar Apr 11 '23 10:04 jkbonfield

I just received the same error, apparently this can still happen https://cirrus-ci.com/task/6336215039868928?logs=test#L6624

test_naive_concat:
	/tmp/cirrus-ci-build/bcftools concat --naive /tmp/jSUwzfxxPq/naive_concat.0.vcf.gz /tmp/jSUwzfxxPq/naive_concat.1.vcf.gz /tmp/jSUwzfxxPq/naive_concat.2.vcf.gz /tmp/jSUwzfxxPq/naive_concat.3.vcf.gz /tmp/jSUwzfxxPq/naive_concat.4.vcf.gz /tmp/jSUwzfxxPq/naive_concat.5.vcf.gz /tmp/jSUwzfxxPq/naive_concat.6.vcf.gz /tmp/jSUwzfxxPq/naive_concat.7.vcf.gz /tmp/jSUwzfxxPq/naive_concat.8.vcf.gz /tmp/jSUwzfxxPq/naive_concat.9.vcf.gz | /tmp/cirrus-ci-build/bcftools view -H
	Non-zero status 9
		Checking the headers of 10 files.
		Done, the headers are compatible.
		Concatenating /tmp/jSUwzfxxPq/naive_concat.0.vcf.gz	0.019761 seconds
		Concatenating /tmp/jSUwzfxxPq/naive_concat.1.vcf.gz	0.005515 seconds
		Concatenating /tmp/jSUwzfxxPq/naive_concat.2.vcf.gz	0.007389 seconds
		Concatenating /tmp/jSUwzfxxPq/naive_concat.3.vcf.gz	0.003913 seconds
		Concatenating /tmp/jSUwzfxxPq/naive_concat.4.vcf.gz[E::bgzf_read_block] Failed to read BGZF block data at offset 182405 expected 12805 bytes; hread returned 12763
		Error: BCF read error
		
.. failed ...

pd3 avatar Jun 25 '24 16:06 pd3