htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

new class: VCFReaderFactory a factory for VCFReader

Open lindenb opened this issue 3 years ago • 2 comments

Description

This PR adds a file VCFReaderFactory that open a VCFReader using a few methods.

VCFReader open(final File vcfFile,boolean requiredIndex);
VCFReader open(final Path vcfFile,boolean requiredIndex);
VCFReader open(final String vcfFile,boolean requiredIndex);

the aim is to provide a generic factory opening instance of VCFReaders instead of opening directltly a VCFReader. This gives a chance to decode variants files that are not VCF (database, multiple VCF, etc...). The default implementation of VCFReaderFactory returns an instance of VCFFileReader.

A new property CUSTOM_VCF_READER_FACTORY in Defaults is used to define the class of the default implementation.

I added a few tests in VCFReaderFactoryTest that are mostly a copy of VCFReaderTest

there are some problem with travis but I'm not sure it comes from my code

Things to think about before submitting:

  • [ ] Make sure your changes compile and new tests pass locally.
  • [X] Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • [X] Extended the README / documentation, if necessary
  • [X] Check your code style.
  • [X] Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

lindenb avatar Apr 23 '21 11:04 lindenb

@lindenb This is a very reasonable thing to want to do. I think it has pretty strong overlap with what @cmnbroad is working on in the CZI grant work right now to enable codec discovery and support multiple input sources fo things like vcf.

I'm not really a fan of injecting a single custom factory class through an environment variable. I know we've done it before for Reads but it was kind of gross then as well. It's just not very flexible if you for instance want to load from something like mixed sources. Registering dynamically loadable codecs in the manifest and then picking the right one automatically is something we're actively working on right now.

I assume you have a particular alternate VariantContext record sources you want to use? Is there any way to identify what type of source it is by the URI you access it at? We're currently building a system to figure this out for dynamically loaded codecs and a real world vcf example could be really useful.

Maybe you could be a beta tester for the new codec api?

lbergelson avatar May 11 '21 19:05 lbergelson

@cmnbroad , thanks for reviewing

I assume you have a particular alternate VariantContext record sources you want to use?

yes for now I've got a wrapper over bcftools (reading BCF) and I'll write another one reading plink files.

Is there any way to identify what type of source it is by the URI you access it at?

yes, I could use bcftools:///path/to/my/bcf for example

We're currently building a system to figure this out for dynamically loaded codecs and a real world vcf example could be really useful.

so I can wait for your implementation, no problem.

lindenb avatar May 14 '21 20:05 lindenb