galaxy
galaxy copied to clipboard
refactoring of iter_headers and test for guess_ext
This PR implements some changes that were originally in https://github.com/galaxyproject/galaxy/pull/12280
- refactoring of iter_headers: instead of using
compression_utils.get_fileobjof the given data is not a FilePrefix it might be better
- to just construct a FilePrefix from it and use
- its
line_iteratormethod (which also usescompression_utils.get_fileobj)
- implement a test running sniffers on all test data
- instead of a doc test its now a unit test
- the test runs guess_ext on all test data (and not all sniffers on all test data .. which would be much stricter .. probably to strict)
- some files in the test data dir needed to be renamed such that the file extension matches the guessed extension which is the assumption of the test (would be nice of someone would go through them to check if all is correct ... some were weired, e.g. vcf.gz is detected as tabular (and not tabular.gz))
Point 2 uncovered a few bugs:
- add missing sniffer for
FastqSolexaindatatypes.conf.sample - bugfix (?) in fastq sniffers: strip spaces from quality values line which were present in some of the test data files, question is if something like this is allowed in real world fastq files
- add missing (?)
@build_sniff_from_prefixfor Bref3
How to test the changes?
(Select all options that apply)
- [ ] I've included appropriate automated tests.
- [x] This is a refactoring of components with existing test coverage.
- [ ] Instructions for manual testing are as follows:
- [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]
License
- [x] I agree to license these contributions under Galaxy's current license.
- [x] I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].
- implement a test running sniffers on all test data
- instead of a doc test its now a unit test
- the test runs guess_ext on all test data (and not all sniffers on all test data .. which would be much stricter .. probably to strict)
- some files in the test data dir needed to be renamed such that the file extension matches the guessed extension which is the assumption of the test (would be nice of someone would go through them to check if all is correct ... some were weired, e.g. vcf.gz is detected as tabular (and not tabular.gz))
Point 2 uncovered a few bugs:
- add missing sniffer for
FastqSolexaindatatypes.conf.sample- bugfix (?) in fastq sniffers: strip spaces from quality values line which were present in some of the test data files, question is if something like this is allowed in real world fastq files
- add missing (?)
@build_sniff_from_prefixfor Bref3
That seems redundant with the upload integration tests, which are breaking now (they use the extension to map the dataset to the expected datatype). I think there's more value in the the integration tests since they also run in the order we use in practice.
I don't think we want to sniff Solexa.
I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?
That seems redundant with the upload integration tests, which are breaking now (they use the extension to map the dataset to the expected datatype). I think there's more value in the the integration tests since they also run in the order we use in practice.
I see .. and I'm fine with reverting.
Why is guess_ext using a different sniff order?
Also it might still be interesting to investigate: Most of the failures are because for files containing "false" the assertion is inverted and files not containing a "." are ignored .. which is both fine and leaves 2 cases.
- for
2.fastqsolexathe difference is because in the testPARENT_SNIFFER_MAP = {'fastqsolexa': 'fastq'}is defined. might be easier to drop this and to rename the file if we decide to keep not sniffing fastqsolexa. - for
0_nonewline.txtthere seems to be amd5mismatch which seems odd.
I don't think we want to sniff Solexa.
Why not?
I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?
Should I fix the test data or make the sniffer more strict?
Also odd that the the integration test does not stumble over the renamed vcf.gz file...
Why not?
It's overlapping in quality scores with the other fastq encodings, so sniffing is difficult, and so old noone is gonna use these in practice.
Why not?
It's overlapping in quality scores with the other fastq encodings, so sniffing is difficult, and so old noone is gonna use these in practice.
Accepted. Then I will remove it from the datatypes xml file as well as the sniffer function as well
Then I will remove it from the datatypes xml file as well as the sniffer function as well
Keep it in the datatypes_conf.xml please, we just don't want a sniffer, I think. If you know you have a solexa encoded file that's fine and you should be able to select it.
@bernt-matthias Can we push this to 22.01 ? There's a few conflicts and failing tests here. We can always apply bugfixes to existing releases if necessary
@bernt-matthias Can we push this to 22.01 ? There's a few conflicts and failing tests here. We can always apply bugfixes to existing releases if necessary
Sure
That seems redundant with the upload integration tests,
Just a few thoughts on this: With the new test I discovered for instance the Bref3 class misses the @build_sniff_from_prefix somehow this seems not covered by the upload test. Btw, can you point me to the upload integration tests? Is this actually testing for correct sniffing?
I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?
The point here is that in this case our test data is wrong, since it contains trailing space spaces, So I could change it to rstrip or fix the test data.