galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

refactoring of iter_headers and test for guess_ext

Open bernt-matthias opened this issue 4 years ago • 9 comments
trafficstars

This PR implements some changes that were originally in https://github.com/galaxyproject/galaxy/pull/12280

  1. refactoring of iter_headers: instead of using compression_utils.get_fileobj of the given data is not a FilePrefix it might be better
  • to just construct a FilePrefix from it and use
  • its line_iterator method (which also uses compression_utils.get_fileobj)
  1. implement a test running sniffers on all test data
  • instead of a doc test its now a unit test
  • the test runs guess_ext on all test data (and not all sniffers on all test data .. which would be much stricter .. probably to strict)
  • some files in the test data dir needed to be renamed such that the file extension matches the guessed extension which is the assumption of the test (would be nice of someone would go through them to check if all is correct ... some were weired, e.g. vcf.gz is detected as tabular (and not tabular.gz))

Point 2 uncovered a few bugs:

  • add missing sniffer for FastqSolexa in datatypes.conf.sample
  • bugfix (?) in fastq sniffers: strip spaces from quality values line which were present in some of the test data files, question is if something like this is allowed in real world fastq files
  • add missing (?) @build_sniff_from_prefix for Bref3

How to test the changes?

(Select all options that apply)

  • [ ] I've included appropriate automated tests.
  • [x] This is a refactoring of components with existing test coverage.
  • [ ] Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • [x] I agree to license these contributions under Galaxy's current license.
  • [x] I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].

bernt-matthias avatar Sep 08 '21 16:09 bernt-matthias

  1. implement a test running sniffers on all test data
  • instead of a doc test its now a unit test
  • the test runs guess_ext on all test data (and not all sniffers on all test data .. which would be much stricter .. probably to strict)
  • some files in the test data dir needed to be renamed such that the file extension matches the guessed extension which is the assumption of the test (would be nice of someone would go through them to check if all is correct ... some were weired, e.g. vcf.gz is detected as tabular (and not tabular.gz))

Point 2 uncovered a few bugs:

  • add missing sniffer for FastqSolexa in datatypes.conf.sample
  • bugfix (?) in fastq sniffers: strip spaces from quality values line which were present in some of the test data files, question is if something like this is allowed in real world fastq files
  • add missing (?) @build_sniff_from_prefix for Bref3

That seems redundant with the upload integration tests, which are breaking now (they use the extension to map the dataset to the expected datatype). I think there's more value in the the integration tests since they also run in the order we use in practice.

I don't think we want to sniff Solexa.

I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?

mvdbeek avatar Sep 13 '21 10:09 mvdbeek

That seems redundant with the upload integration tests, which are breaking now (they use the extension to map the dataset to the expected datatype). I think there's more value in the the integration tests since they also run in the order we use in practice.

I see .. and I'm fine with reverting.

Why is guess_ext using a different sniff order?

Also it might still be interesting to investigate: Most of the failures are because for files containing "false" the assertion is inverted and files not containing a "." are ignored .. which is both fine and leaves 2 cases.

  • for 2.fastqsolexa the difference is because in the test PARENT_SNIFFER_MAP = {'fastqsolexa': 'fastq'} is defined. might be easier to drop this and to rename the file if we decide to keep not sniffing fastqsolexa.
  • for 0_nonewline.txt there seems to be a md5 mismatch which seems odd.

I don't think we want to sniff Solexa.

Why not?

I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?

Should I fix the test data or make the sniffer more strict?

bernt-matthias avatar Sep 13 '21 10:09 bernt-matthias

Also odd that the the integration test does not stumble over the renamed vcf.gz file...

bernt-matthias avatar Sep 13 '21 11:09 bernt-matthias

Why not?

It's overlapping in quality scores with the other fastq encodings, so sniffing is difficult, and so old noone is gonna use these in practice.

mvdbeek avatar Sep 16 '21 09:09 mvdbeek

Why not?

It's overlapping in quality scores with the other fastq encodings, so sniffing is difficult, and so old noone is gonna use these in practice.

Accepted. Then I will remove it from the datatypes xml file as well as the sniffer function as well

bernt-matthias avatar Sep 16 '21 09:09 bernt-matthias

Then I will remove it from the datatypes xml file as well as the sniffer function as well

Keep it in the datatypes_conf.xml please, we just don't want a sniffer, I think. If you know you have a solexa encoded file that's fine and you should be able to select it.

mvdbeek avatar Sep 16 '21 09:09 mvdbeek

@bernt-matthias Can we push this to 22.01 ? There's a few conflicts and failing tests here. We can always apply bugfixes to existing releases if necessary

mvdbeek avatar Sep 20 '21 13:09 mvdbeek

@bernt-matthias Can we push this to 22.01 ? There's a few conflicts and failing tests here. We can always apply bugfixes to existing releases if necessary

Sure

bernt-matthias avatar Sep 20 '21 14:09 bernt-matthias

That seems redundant with the upload integration tests,

Just a few thoughts on this: With the new test I discovered for instance the Bref3 class misses the @build_sniff_from_prefix somehow this seems not covered by the upload test. Btw, can you point me to the upload integration tests? Is this actually testing for correct sniffing?

I don't think additional whitespaces are valid, as they would throw off the sequence to quality score correspondence ?

The point here is that in this case our test data is wrong, since it contains trailing space spaces, So I could change it to rstrip or fix the test data.

bernt-matthias avatar Jan 14 '22 16:01 bernt-matthias