snps icon indicating copy to clipboard operation
snps copied to clipboard

Consolidate code to resolve SNP issues

Open apriha opened this issue 4 years ago • 6 comments

Create a new Resolver class to resolve SNP issues. Refactor SNPs._assign_par_snps to this class. Also, this class could be used to implement solutions to #13 and #19 .

apriha avatar Nov 28 '19 05:11 apriha

Central to resolving SNP issues (e.g., assigning SNPs on chrom 0 (#13), populating missing RSIDs (#19), and assigning PAR SNPs) is having a resource that maintains a list of RSIDs, and at a minimum, for each RSID its chromosome and position (for at least GRCh37 - remapping to other builds could be performed dynamically).

Initially I was thinking that a local resource of RSIDs could be built-up by querying an API (e.g., like demonstrated in #19). However, in the absence of a database, that could result in concurrency issues if multiple threads are trying to build the resource.

Additionally, there could be a large amount of data transferred for these requests. For more than 100K SNPs to process, NCBI Variation Services recommends downloading data from the FTP site.

Therefore, I think the best solution here is to download dbSNP as a resource. Specifically, ftp://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz would provide a list of all SNPs for GRCh37, with the RSID, reference allele, alternate alleles, and a variety of flags indicating properties of the SNP (e.g. clinical significance).

Having this reference would also make saving SNPs as a VCF more robust, providing the ALT alleles, plus the additional flags.

The only problem is the compressed VCF file is ~14GB (~100GB uncompressed) - too large to load into memory with pandas. A tabix file is provided that indexes the VCF, but a pure Python solution loading this resource would be preferred.

Perhaps it'd be possible to parse the tabix and load portions of the gzip in a similar manner to how large VCFs are de-compressed and loaded dynamically with snps?

apriha avatar Jan 08 '20 07:01 apriha

Alternatively, I think it could be possible to read the gzip and post-process it into compressed pickles, one for each chromosome (perhaps more chunks if these are still too large). Then the Resolver could fix issues for SNPs, loading the pickles chromosome by chromosome.

apriha avatar Jan 08 '20 15:01 apriha

So this is an interesting proposition - my one concern is that the 14GB would need to be downloaded for this functionality to work. In our use case, for example, we actually use snps within a Lambda function which has no persistent memory. Therefore this data would need to re-downloaded with each request. I'm not sure how common my particular use case would be, however!

If we are able to query 100k snps via the NCBI Variation Services, then we could sequentially call this endpoint with 100k queries each time until all the snps have been retrieved.

We could also have both of these pieces of functionality - and let the package user decide which method suits them best.

willgdjones avatar Jan 09 '20 16:01 willgdjones

So with the NCBI Variation Services API, requests are limited to one / second. I think the following two endpoints could be used to resolve issues:

  1. /vcf/file/set_rsids This could be used to solve #19. However, it's limited to 50K SNPs per request. Additionally, data for this endpoint is POSTed to the API; for each SNP, it requires the CHROM, POS, REF allele, and ALT allele. To anonymize the data, generic ALT alleles could be POSTed.
  2. /refsnp/{rsid} This is currently used to assign PAR SNPs, and in a similar manner, it could be used to solve #13. However, this endpoint returns details for one SNP at a time, and since the API limit is one request per second, the time required for this increases linearly with the number of SNPs to process.

I agree that both types of processing should be provided, especially since the dbSNP VCF is 14GB, and users may only need to assign PAR SNPs. But for bulk processing of files, I think that the dbSNP VCF would be much more efficient.

Inspecting the dbSNP VCF, it has all of the information required to resolve these issues - and more, e.g., replacing the need for downloading the FASTA sequences to write VCF, replacing the "ID" genotype with the actual insertions / deletions, and populating missing pos in some files.

apriha avatar Jan 10 '20 07:01 apriha

In our use case, for example, we actually use snps within a Lambda function which has no persistent memory.

Would there be any way to build the resources into the container / image filesystem where snps is being executed? Then, snps could point to that resource directory and load local resources.

apriha avatar Jan 10 '20 07:01 apriha

Would there be any way to build the resources into the container / image filesystem where snps is being executed? Then, snps could point to that resource directory and load local resources.

On Lambda, this functionality is not available unfortunately.

willgdjones avatar Jan 10 '20 13:01 willgdjones