gemini icon indicating copy to clipboard operation
gemini copied to clipboard

Question: Can Load Function add VCF to existing database?

Open dgaston opened this issue 12 years ago • 9 comments

Not sure if the load function currently handles this or if it would need to be added. This way it would be possible to use gemini to maintain a database of locally sequenced exomes/genomes. Useful for screening variants for disease sequencing projects based on local population controls.

dgaston avatar Mar 25 '13 14:03 dgaston

Currently, the answer is no. There is no mechanism to update databases. One alternative would be to use vcftools' vcf-merge operation to combine old and new VCFs and then load the result into gemini.

In the future I'd like to support this, but it will be some time before we address this, as there are, in my opinion, useful workarounds.

arq5x avatar Mar 25 '13 15:03 arq5x

sorry from this answer I am assuming then that gemini creates a single database for each vcf loaded?

That is unfortunate... I was just thinking this would be a powerfull way to add your sequenced samples into a database in order to query them... but not all samples arrive at the same time, and according to your own readme file , loading (as expected) is a slow process, so it would be extremely usefull if you add a feature to append a annotated vcf to an existing database...

The workaround means having to recreate a new database with a merged VCF everytime you have additional samples.

duartemolha avatar Apr 03 '13 13:04 duartemolha

I've made a fork, if I have the time over the next little while I'll see if I can cook up some sort of add function.

dgaston avatar Apr 03 '13 13:04 dgaston

fantastic @dgaston :)

duartemolha avatar Apr 03 '13 13:04 duartemolha

@duartemolha yes, you are correct. it is certainly possible to update a database, but we have not yet implemented it, mainly because there are alternatives (above). Some complexities exist - e.g. handling new observed alleles at the same locus (INDELs especially), and updating all of the pre-computed metrics that are based on genotypes.

That said, it is certainly doable and is a goal for the next few months.

@dgaston we would certainly be interested in a well tested pull request for this!

arq5x avatar Apr 03 '13 14:04 arq5x

I'll take a stab at it. Even though I've been programming as a bioinformatician for years I don't claim to be the best software designer :)

One thought, although it requires many more changes, is changes to the existing sqlite database structure. I am assuming (based on the above statement) that it is indexed based on just location? Which is why handling new alleles at a locus is problematic?

When I was looking at creating a local sqlite3 database to store all of our existing variant data from our exome projects I was planning on indexing on chrom, start, end, ref, alt columns so that the database was unique for specific variants and not genomic position. But that may make it difficult to use some of the other tools for rapidly indexing and parsing through based on genomic locations.

dgaston avatar Apr 03 '13 14:04 dgaston

I would love to help but my python is not up to the task... I program mainly in perl ... I know I know ... time for a refresher course! :)

duartemolha avatar Apr 03 '13 14:04 duartemolha

Load function still does not add vcf to database!

kirannarta avatar Feb 22 '15 04:02 kirannarta

I wonder if there is any update on this feature?

sa9 avatar Nov 05 '16 16:11 sa9