pytabix icon indicating copy to clipboard operation
pytabix copied to clipboard

pytabix + multiprocessing?

Open gibiansky opened this issue 10 years ago • 8 comments

I'm having some issues using PyTabix with the python multiprocessing module; it seems that it somehow results in memory corruption of some sort, and messages along the lines of

[get_intv] the following line cannot be parsed and skipped: S=2546446;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000080001000014000100;WGT=1;VC=SNV;INT;KGPhase1;KGPROD;CAF=[0.9991,0.0009183];COMMON=0

The same file does not cause any problems when I'm not using multiprocessing.

Do you know anything about this? I looked for a close method on tabix file objects, but couldn't find one - could there be an issue with too many file descriptors to the same file, or something?

gibiansky avatar Jun 18 '14 15:06 gibiansky

I haven't tried to use pytabix with multiprocessing and I don't know why this problem arises.

Could you please share a code snippet that reproduces the error?

slowkow avatar Jun 18 '14 15:06 slowkow

I'm having a bit of difficulty getting the full error (with segfault/memory corruption), so maybe that aspect isn't even tabix's fault. However the following code

import multiprocessing
import tabix

dbsnp = tabix.open("./dbsnp-all-2013-12-11.vcf.gz")

def query_region(*args):
    z = []
    for x in xrange(20):
        z.extend(list(dbsnp.query('13', 24008000, 24009000)))
    return z

use_processes = True # adjustable to test for error
if use_processes:
    pool = multiprocessing.Pool(10)
    pool.map(query_region, xrange(100), 1)
else:
    for x in xrange(100):
        query_region(x)

prints nothing (as expected) if use_processes is False, but if you set it to true you get:

[get_intv] the following line cannot be parsed and skipped: S=24020770;dbSNPBuildID=135;SSR=0;SAO=0;VP=0x050000000001100014000100;WGT=1;VC=SNV;KGPhase1;KGPROD;CAF=[0.9853,0.01469];COMMON=1

repeated some number of times (the number varies).

This is just a VCF downloaded from the dbSNP FTP server.

gibiansky avatar Jun 18 '14 15:06 gibiansky

I can't reproduce the error when I run the code below, using the GTF that I provide with pytabix.

This leads me to believe that your VCF file might be the problem...

import multiprocessing
import tabix

dbsnp = tabix.open("test/example.gtf.gz")

def query_region(*args):
    z = []
    for x in xrange(20):
        z.extend(list(dbsnp.query('chr2', 20000, 30000)))
    return z

use_processes = True # adjustable to test for error
if use_processes:
    pool = multiprocessing.Pool(10)
    pool.map(query_region, xrange(100), 1)
else:
    for x in xrange(100):
        query_region(x)

slowkow avatar Jun 18 '14 16:06 slowkow

I do not think it is the contents of the VCF, as it works fine without multiprocessing. However, I imagine it might be the size - the VCF is something like 1.2 GB. A query on example.gtf.gz takes no time at all, while on my VCF, it takes on the order of 2-3 seconds for each query call.

Maybe if you have a very fast query, the processes don't have time to interfere, but if you have a longer one, they can do so occasionally?

Anyway, I have no idea what's going on here :(

gibiansky avatar Jun 18 '14 16:06 gibiansky

You might be right about timing, but I'm not sure. It would be worth reading the literature about using C extensions with multiprocessing. I skimmed a few Google results but didn't find anything relevant.

  1. Does the code below produce the same error? I moved dbsnp inside query_region().
  2. Do you get an error if you run your code on a smaller file? You might take a few lines from your 1.2 GB file. I tried to use a ~100MB file and could not reproduce the error.
import multiprocessing
import tabix

def query_region(*args):
    dbsnp = tabix.open("./dbsnp-all-2013-12-11.vcf.gz")
    z = []
    for x in xrange(20):
        z.extend(list(dbsnp.query('chr1', 200000, 300000)))
    return z

use_processes = True # adjustable to test for error
if use_processes:
    pool = multiprocessing.Pool(10)
    pool.map(query_region, xrange(100), 1)
else:
    for x in xrange(100):
        query_region(x)

slowkow avatar Jun 18 '14 17:06 slowkow

I am adding to this issue because I think it is pertinent. I access tabix files in a program which runs under MPI on HPC 16 core cluster (so one tabix file being accessed by 16 cores simultaneously.) Sometimes it works fine, other times I have ended up with a 974 line long traceback which involves pytabix and suggests that double freeing is going on. So I did some defensive programming, and when the error occurs, I get None type records returned from the tabix iterator.

Often when it works, the cluster is only lightly used, and it fails more often when the cluster is working harder. Is Tabix / pytabix supposed to work in multiuser type environments?

marklivingstone avatar Sep 22 '15 04:09 marklivingstone

Could I ask you to share a code snippet that reproduces the error? Could you share the error, too?

slowkow avatar Sep 22 '15 11:09 slowkow

Hi Kamil,

The code is part of a massive framework, but I will see what I can do. I certainly can get you the error message. It will probably be after Tuesday.

Kind Regards,

Mark Livingstone

PhD Candidate G23_2.31 Institute for Integrated and Intelligent Systems School of Information and Communication Technology Griffith University, Gold Coast campus Queensland, 4222, Australia

E-mail: [email protected]

On 22 September 2015 at 21:34, Kamil Slowikowski [email protected] wrote:

Could I ask you to share a code snippet that reproduces the error? Could you share the error, too?

— Reply to this email directly or view it on GitHub https://github.com/slowkow/pytabix/issues/2#issuecomment-142260806.

marklivingstone avatar Sep 25 '15 02:09 marklivingstone