pheweb
pheweb copied to clipboard
Feature request: multithreaded `pheweb matrix`
Would it be possible to perform the matrix processing stage on a per-chromosome basis to allow distribution of the work?
Thanks!
There's two ways to deal with the single-threaded time taken by pheweb matrix
:
(a) Parallelize pheweb matrix
.
(b) Instead of running pheweb process
, run
pheweb phenolist verify &&
pheweb parse-input-files &&
pheweb sites &&
pheweb make-gene-aliases-trie &&
pheweb add-rsids &&
pheweb add-genes &&
pheweb make-tries &&
pheweb augment-phenos
and then, in parallel, run both pheweb matrix
and
pheweb manhattan &&
pheweb qq &&
pheweb bgzip-phenos &&
pheweb top-hits &&
pheweb phenotypes &&
pheweb gather-pvalues-for-each-gene
If the single threaded pheweb matrix
finishes first, this solution is likely faster.
I'm not sure how to implement this option well. If pheweb used multiprocessing
to run pheweb matrix
in parallel with the other commands, debugging and terminal output will be harder. Perhaps I could add an option pheweb process --matrix-in-parallel
, in which pheweb matrix
output would be redirected to a log file?
Simply because of the debugging complexity, I like option (a), like you proposed.
Currently, pheweb makes the matrix from the un-compressed per-phenotype annotated files. (pheno/*
on https://github.com/statgen/pheweb/blob/master/etc/detailed-internal-dataflow.md). Since these are uncompressed, they're not indexed.
To read a single chromosome from an individual phenotype file for use in pheweb matrix
, either:
- Each reader does a bisect to find the line to start reading on
- Read directly from the tabixed files (which might be slow)
What do you think is the right approach? Are you interested in implementing this feature?
I'm not sure what the best route is, still getting a feeling for how this whole process works :) Unfortunately bandwidth limits will keep me in a user role for the time being.
I'm willing to try option 1.b. in the near future - I didn't realize stages could be run concurrently.
I'm currently processing a set of 12m snps, 2000 phenotypes, early calculations indicate about 100hours of processing for the pheweb matrix
stage. It's running one core at ~35% load and should be done in 3.5 days. How long would you anticipate the remaining steps should take?
How many cores on the machine?
pheweb process -h
will show all the steps it runs. The only heavy one remaining is pheweb gather-pvalues-for-each-gene
, which is 4x faster than pheweb matrix
with 40 cores. I've never run a large dataset with fewer, but hopefully it works quickly. Once it starts it should give a good estimate.
Do you predict runtime with something like gzip -cd .generated-by-pheweb/tmp/matrix.tsv.gz | wc -l
?
I'm working with 32 cores. Good to know about gather-pvalues-for-each-gene
- I assume that one is multithreaded?
Yeah, I estimated runtime based on time to generate output in matrix.tsv.gz
.
Yep, it's multithreaded.
Hi, does anyone in this thread know of a branch or fork that has implemented a parallelized pheweb matrix
?
@hanayik I'm not aware of such a fork. I'd love to have a parallel matrix feature and I can help if you'd like to work on it. Changes would be needed in https://github.com/statgen/pheweb/blob/master/pheweb/load/matrix.py and https://github.com/statgen/pheweb/blob/master/pheweb/load/cffi/x.cpp .
It seems like most of the time goes to gzipping the output. I just changed the gzip compression level from 6 to 2 which will hopefully be much (3x?) faster. But there's still a lot to gain from parallelization.
Some considerations:
- Parallelization will have to happen by genomic region, but which regions should be used? To start with, I think each process should handle one chromosome. That would be a 10x speedup. Later, this could be changed to ranges covering one
num_variants / config.num_cpus
variants according togenerated-by-pheweb/sites/sites.tsv
. Even when just parallelizing by chromosome, it's probably worth having the master process scan throughgenerated-by-pheweb/sites/sites.tsv
to check which chromosomes occur. - Parallelization should happen in Python, and each multiprocessing-spawned thread should call C++ through cffi.
- Each chromosome-handling-child-process will need to find the byte-offset in
generated-by-pheweb/sites/sites.tsv
and in eachgenerated-by-pheweb/pheno/*
file at which to start reading its chromosome. To start, I think python should just do a bisect search on each file. That's easy since every row begins with chrom, pos, ref, alt. - Each process will write a file like
generated-by-pheweb/tmp/matrix-chr12.tsv.gz
(with no bgzip-end-of-file-marking empty block), and then the master process will concatenate these and append a single bgzip-end-of-file-marking empty block to makematrix.tsv.gz
.
Looks not so bad. I have 1262 phenotypes and, with single thread, pheweb matrix
only takes 62880 seconds. I have a question: Is there any way to speed up pheweb bgzip-phenos
step?
==> Starting `pheweb phenolist verify`
The 1262 phenotypes in '/projects/ps-janssen4/dsci-csb/user/sguo2/pheweb/pheno-list.json' look good.
==> Completed in 0 seconds
==> Starting `pheweb sites`
The list of sites is up-to-date!
==> Completed in 1 seconds
==> Starting `pheweb make-gene-aliases-trie`
gene aliases are at '/home/sguo2/.pheweb/cache/gene_aliases-v29-hg19.marisa_trie'
==> Completed in 0 seconds
==> Starting `pheweb add-rsids`
rsid annotation is up-to-date!
==> Completed in 0 seconds
==> Starting `pheweb add-genes`
gene annotation is up-to-date!
==> Completed in 0 seconds
==> Starting `pheweb make-tries`
tries are up-to-date!
==> Completed in 0 seconds
==> Starting `pheweb augment-phenos`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 1 seconds
==> Starting `pheweb manhattan`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 1 seconds
==> Starting `pheweb qq`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 2 seconds
==> Starting `pheweb matrix`
==> Completed in 62880 seconds
==> Starting `pheweb bgzip-phenos`
Processing 1262 phenos
@Shicheng-Guo Yeah, if your system is CPU-limited (rather than IO-limited), then reducing the gzip compression level will probably make pheweb bgzip-phenos
2x-3x faster. Unfortunately, I don't think pysam lets you choose gzip compression level. It looks easy to switch from pysam.tabix_compress
to the code in load/cffi/x.cpp
, but I personally am not going to do it, because I only run pheweb on machines with plenty of CPU. If you'd like to work on this, please start a new github issue and I'll be happy to give advice.