pytabix
pytabix copied to clipboard
pytabix + multiprocessing?
I'm having some issues using PyTabix with the python multiprocessing
module; it seems that it somehow results in memory corruption of some sort, and messages along the lines of
[get_intv] the following line cannot be parsed and skipped: S=2546446;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000080001000014000100;WGT=1;VC=SNV;INT;KGPhase1;KGPROD;CAF=[0.9991,0.0009183];COMMON=0
The same file does not cause any problems when I'm not using multiprocessing
.
Do you know anything about this? I looked for a close
method on tabix file objects, but couldn't find one - could there be an issue with too many file descriptors to the same file, or something?
I haven't tried to use pytabix with multiprocessing and I don't know why this problem arises.
Could you please share a code snippet that reproduces the error?
I'm having a bit of difficulty getting the full error (with segfault/memory corruption), so maybe that aspect isn't even tabix
's fault. However the following code
import multiprocessing
import tabix
dbsnp = tabix.open("./dbsnp-all-2013-12-11.vcf.gz")
def query_region(*args):
z = []
for x in xrange(20):
z.extend(list(dbsnp.query('13', 24008000, 24009000)))
return z
use_processes = True # adjustable to test for error
if use_processes:
pool = multiprocessing.Pool(10)
pool.map(query_region, xrange(100), 1)
else:
for x in xrange(100):
query_region(x)
prints nothing (as expected) if use_processes
is False
, but if you set it to true you get:
[get_intv] the following line cannot be parsed and skipped: S=24020770;dbSNPBuildID=135;SSR=0;SAO=0;VP=0x050000000001100014000100;WGT=1;VC=SNV;KGPhase1;KGPROD;CAF=[0.9853,0.01469];COMMON=1
repeated some number of times (the number varies).
This is just a VCF downloaded from the dbSNP FTP server.
I can't reproduce the error when I run the code below, using the GTF that I provide with pytabix.
This leads me to believe that your VCF file might be the problem...
import multiprocessing
import tabix
dbsnp = tabix.open("test/example.gtf.gz")
def query_region(*args):
z = []
for x in xrange(20):
z.extend(list(dbsnp.query('chr2', 20000, 30000)))
return z
use_processes = True # adjustable to test for error
if use_processes:
pool = multiprocessing.Pool(10)
pool.map(query_region, xrange(100), 1)
else:
for x in xrange(100):
query_region(x)
I do not think it is the contents of the VCF, as it works fine without multiprocessing
. However, I imagine it might be the size - the VCF is something like 1.2 GB. A query
on example.gtf.gz
takes no time at all, while on my VCF, it takes on the order of 2-3 seconds for each query call.
Maybe if you have a very fast query, the processes don't have time to interfere, but if you have a longer one, they can do so occasionally?
Anyway, I have no idea what's going on here :(
You might be right about timing, but I'm not sure. It would be worth reading the literature about using C extensions with multiprocessing. I skimmed a few Google results but didn't find anything relevant.
- Does the code below produce the same error? I moved
dbsnp
insidequery_region()
. - Do you get an error if you run your code on a smaller file? You might take a few lines from your 1.2 GB file. I tried to use a ~100MB file and could not reproduce the error.
import multiprocessing
import tabix
def query_region(*args):
dbsnp = tabix.open("./dbsnp-all-2013-12-11.vcf.gz")
z = []
for x in xrange(20):
z.extend(list(dbsnp.query('chr1', 200000, 300000)))
return z
use_processes = True # adjustable to test for error
if use_processes:
pool = multiprocessing.Pool(10)
pool.map(query_region, xrange(100), 1)
else:
for x in xrange(100):
query_region(x)
I am adding to this issue because I think it is pertinent. I access tabix files in a program which runs under MPI on HPC 16 core cluster (so one tabix file being accessed by 16 cores simultaneously.) Sometimes it works fine, other times I have ended up with a 974 line long traceback which involves pytabix and suggests that double freeing is going on. So I did some defensive programming, and when the error occurs, I get None type records returned from the tabix iterator.
Often when it works, the cluster is only lightly used, and it fails more often when the cluster is working harder. Is Tabix / pytabix supposed to work in multiuser type environments?
Could I ask you to share a code snippet that reproduces the error? Could you share the error, too?
Hi Kamil,
The code is part of a massive framework, but I will see what I can do. I certainly can get you the error message. It will probably be after Tuesday.
Kind Regards,
Mark Livingstone
PhD Candidate G23_2.31 Institute for Integrated and Intelligent Systems School of Information and Communication Technology Griffith University, Gold Coast campus Queensland, 4222, Australia
E-mail: [email protected]
On 22 September 2015 at 21:34, Kamil Slowikowski [email protected] wrote:
Could I ask you to share a code snippet that reproduces the error? Could you share the error, too?
— Reply to this email directly or view it on GitHub https://github.com/slowkow/pytabix/issues/2#issuecomment-142260806.