galaxy refactoring of iter_headers and test for guess

trafficstars

This PR implements some changes that were originally in https://github.com/galaxyproject/galaxy/pull/12280

refactoring of iter_headers: instead of using compression_utils.get_fileobj of the given data is not a FilePrefix it might be better

to just construct a FilePrefix from it and use
its line_iterator method (which also uses compression_utils.get_fileobj)

implement a test running sniffers on all test data

instead of a doc test its now a unit test
the test runs guess_ext on all test data (and not all sniffers on all test data .. which would be much stricter .. probably to strict)
some files in the test data dir needed to be renamed such that the file extension matches the guessed extension which is the assumption of the test (would be nice of someone would go through them to check if all is correct ... some were weired, e.g. vcf.gz is detected as tabular (and not tabular.gz))

Point 2 uncovered a few bugs:

add missing sniffer for FastqSolexa in datatypes.conf.sample
bugfix (?) in fastq sniffers: strip spaces from quality values line which were present in some of the test data files, question is if something like this is allowed in real world fastq files
add missing (?) @build_sniff_from_prefix for Bref3

How to test the changes?

(Select all options that apply)

[ ] I've included appropriate automated tests.
[x] This is a refactoring of components with existing test coverage.
[ ] Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

[x] I agree to license these contributions under Galaxy's current license.
[x] I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].

Sep 08 '21 16:09 bernt-matthias

implement a test running sniffers on all test data

instead of a doc test its now a unit test

the test runs guess_ext on all test data (and not all sniffers on all test data .. which would be much stricter .. probably to strict)

some files in the test data dir needed to be renamed such that the file extension matches the guessed extension which is the assumption of the test (would be nice of someone would go through them to check if all is correct ... some were weired, e.g. vcf.gz is detected as tabular (and not tabular.gz))

Point 2 uncovered a few bugs:

add missing sniffer for FastqSolexa in datatypes.conf.sample

bugfix (?) in fastq sniffers: strip spaces from quality values line which were present in some of the test data files, question is if something like this is allowed in real world fastq files

add missing (?) @build_sniff_from_prefix for Bref3

That seems redundant with the upload integration tests, which are breaking now (they use the extension to map the dataset to the expected datatype). I think there's more value in the the integration tests since they also run in the order we use in practice.

I don't think we want to sniff Solexa.

I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?

Sep 13 '21 10:09 mvdbeek

That seems redundant with the upload integration tests, which are breaking now (they use the extension to map the dataset to the expected datatype). I think there's more value in the the integration tests since they also run in the order we use in practice.

I see .. and I'm fine with reverting.

Why is guess_ext using a different sniff order?

Also it might still be interesting to investigate: Most of the failures are because for files containing "false" the assertion is inverted and files not containing a "." are ignored .. which is both fine and leaves 2 cases.

for 2.fastqsolexa the difference is because in the test PARENT_SNIFFER_MAP = {'fastqsolexa': 'fastq'} is defined. might be easier to drop this and to rename the file if we decide to keep not sniffing fastqsolexa.
for 0_nonewline.txt there seems to be a md5 mismatch which seems odd.

I don't think we want to sniff Solexa.

Why not?

I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?

Should I fix the test data or make the sniffer more strict?

Sep 13 '21 10:09 bernt-matthias

Also odd that the the integration test does not stumble over the renamed vcf.gz file...

Sep 13 '21 11:09 bernt-matthias

Why not?

It's overlapping in quality scores with the other fastq encodings, so sniffing is difficult, and so old noone is gonna use these in practice.

Sep 16 '21 09:09 mvdbeek

Why not?

It's overlapping in quality scores with the other fastq encodings, so sniffing is difficult, and so old noone is gonna use these in practice.

Accepted. Then I will remove it from the datatypes xml file as well as the sniffer function as well

Sep 16 '21 09:09 bernt-matthias

Then I will remove it from the datatypes xml file as well as the sniffer function as well

Keep it in the datatypes_conf.xml please, we just don't want a sniffer, I think. If you know you have a solexa encoded file that's fine and you should be able to select it.

Sep 16 '21 09:09 mvdbeek

@bernt-matthias Can we push this to 22.01 ? There's a few conflicts and failing tests here. We can always apply bugfixes to existing releases if necessary

Sep 20 '21 13:09 mvdbeek

@bernt-matthias Can we push this to 22.01 ? There's a few conflicts and failing tests here. We can always apply bugfixes to existing releases if necessary

Sure

Sep 20 '21 14:09 bernt-matthias

That seems redundant with the upload integration tests,

Just a few thoughts on this: With the new test I discovered for instance the Bref3 class misses the @build_sniff_from_prefix somehow this seems not covered by the upload test. Btw, can you point me to the upload integration tests? Is this actually testing for correct sniffing?

I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?

The point here is that in this case our test data is wrong, since it contains trailing space spaces, So I could change it to rstrip or fix the test data.

Jan 14 '22 16:01 bernt-matthias

galaxy
galaxy copied to clipboard

refactoring of iter_headers and test for guess_ext

How to test the changes?

License

galaxy galaxy copied to clipboard

refactoring of iter_headers and test for guess_ext

How to test the changes?

License

galaxy
galaxy copied to clipboard