polyfun icon indicating copy to clipboard operation
polyfun copied to clipboard

Compute LD-scores for each SNP bin using .bcor files

Open JoniColeman opened this issue 3 years ago • 16 comments

Hi,

I am interested in creating priors with PolyFun using approach 3. I would like to pass estimates of LD from the data I have, stored in .bcor files. Is this possible? The --ld-ukb flag does something similar, but seems to be set up for using UK Biobank specifically. I can see how to do the finemapping step with my bcor LD estimates, but I suspect accurate LD estimation in creating the priors will also be very important.

Thanks,

Joni

JoniColeman avatar Jan 06 '21 16:01 JoniColeman

Hi Joni,

I can add support for .bcor files but I need to think about what the interface should look like. Can you please explain how the data is arranged in your case? e.g. do you have a single chromosome-wide bcor file, or a bunch of bcor files that cover an entire chromosome?

omerwe avatar Jan 07 '21 07:01 omerwe

The latter, which may not be ideal. I have a .bcor for each 3Mb segment in the genome, akin to what you've generated for UK Biobank (although in a different format obviously)

JoniColeman avatar Jan 07 '21 17:01 JoniColeman

Hi, I added support for computing LD-scores from an arbitrary set of input .bcor files. Please see the end of the updated wiki page for details, and please let me know if it works for you!

Cheers,

Omer

omerwe avatar Jan 08 '21 12:01 omerwe

This seems to work for generating LD scores - thank you!

JoniColeman avatar Jan 28 '21 16:01 JoniColeman

Hi, sorry to reopen this one, but I've run into a memory issue with the script - it works fine for most of my data, but running on chr 1-4 takes more than 5 days on a 90Gb node. The script runs fine - it's just that the read-in of the bcors takes too long.

I have a couple of solutions that I'd like to explore:

Conversion of .bcor to .npz files - the UK Biobank data default in PolyFun seems to run efficiently, so I'm wondering if converting the .bcor files from LDStore into .npzs might work. Would the script be adaptable to running with .npzs rather than .bcors? Splitting the chromosomes at the centromere -there shouldn't be long-range LD passing across the centromere, so I think the LD score calculation could be run for half-chromosomes in this way. Would it be feasible to create LD score files per half-chromosome and then combine them post-hoc? Do either of these solutions sound like they'd be feasible? If so, would it be possible to alter the script to enable them? Thanks again - sorry to keep asking you to change your software to fit my needs!

JoniColeman avatar Feb 17 '21 14:02 JoniColeman

Hi Joni,

Not a problem, but I'm surprised that the loading of .bcor files is the memory bottleneck. .bcor files are packed pretty efficiently, I don't think they have any significant disadvantage vs .npz files. Could it be that you simply have more SNPs per LD region than in our UKB data? We applied a MAF>0.1% filter, which left us with ~20K SNPs per region. If that's the case, simple solutions are (1) apply some more stringent SNP filter; or (2) use smaller regions (i.e. <3Mb long). The first option in particular should be easy to try out --- you won't have to recompute the LD, we can just apply a filter to ignore some of the SNPs that don't pass the stringent threshold. What do you think?

omerwe avatar Feb 17 '21 18:02 omerwe

Ok - I was also surprised at how long the read-in took. It may be that there is something wrong with the way I am running the script, so I'll have a further exploration and get back to you. I haven't got especially high SNP density in any regions (max is about 15K) so I don't think that should be an issue. Thanks!

JoniColeman avatar Feb 18 '21 08:02 JoniColeman

An update: There seems to be a general problem with reading .bcor files. I was previously using a custom-made .bcor, so I wasn't sure whether my coding had messed it up, but even when using a .bcor direct from LDStore, it takes >35 minutes to read into polyfun. This is true both for the LDScore script, but also for finemapper.py, which then also shows a subsequent error with SuSie.

/home/coleman/Tools/polyfun/finemapper.py \
--chr 21 \
--start 19000001 \
--end 22000001 \
--out FM_Output/daner_mdd_boma.chr21.19000001.22000001.gz \
--ld mdd_boma_eur_sr-qc.hg19.ch.fl.bgn.bgen.21.19000001.22000001.bcor \
--method susie \
--sumstats Priors_Output/daner_mdd_boma.21.snpvar_ridge_constrained.gz \
--memory 90 \
--max-num-causal 5 \
--allow-missing \
--n 1648 \
--verbose
*********************************************************************
* Fine-mapping Wrapper
* Version 1.0.0
* (C) 2019-2021 Omer Weissbrod
*********************************************************************

[INFO]  Loading sumstats file...
[INFO]  Loaded sumstats for 90947 SNPs in 0.61 seconds
[INFO]  Loading LD file mdd_boma_eur_sr-qc.hg19.ch.fl.bgn.bgen.21.19000001.22000001.bcor
[INFO]  Done in 2299.05 seconds
[INFO]  Flipping the effect-sign of 4206 SNPs that are flipped compared to the LD panel
[INFO]  Starting functionally-informed SuSiE fine-mapping for chromosome 21 BP 19000001-22000001 (8810 SNPs)
[WARNING]  R[write to console]: Error in susie_ss(XtX = XtX, Xty = Xty, yty = var_y * (n - 1), n = n,  :
  XtX matrix contains NA.

Traceback (most recent call last):
  File "/home/coleman/Tools/polyfun/finemapper.py", line 733, in finemap
    susie_obj = self.susieR.susie_suff_stat(
AttributeError: module 'susieR' has no attribute 'susie_suff_stat'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/coleman/Tools/polyfun/finemapper.py", line 1129, in <module>
    verbose=args.verbose, ld_file=args.ld, debug_dir=args.debug_dir, allow_missing=args.allow_missing, susie_outfile=args.susie_outfile)
  File "/home/coleman/Tools/polyfun/finemapper.py", line 758, in finemap
    prior_weights=(prior_weights.reshape((m,1)) if use_prior_causal_prob else self.R_null)
  File "/home/coleman/.local/lib/python3.6/site-packages/rpy2/robjects/functions.py", line 198, in __call__
    .__call__(*args, **kwargs))
  File "/home/coleman/.local/lib/python3.6/site-packages/rpy2/robjects/functions.py", line 125, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
  File "/home/coleman/.local/lib/python3.6/site-packages/rpy2/rinterface_lib/conversion.py", line 44, in _
    cdata = function(*args, **kwargs)
  File "/home/coleman/.local/lib/python3.6/site-packages/rpy2/rinterface.py", line 623, in __call__
    raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in susie_ss(XtX = XtX, Xty = Xty, yty = var_y * (n - 1), n = n,  :
  XtX matrix contains NA.

JoniColeman avatar Feb 18 '21 16:02 JoniColeman

Hi Joni,

Sorry for this. I didn't write the .bcor code (it was written by the author of LDstore), so I don't know it that well. However, looking at the code I found that it always created an LD matrix of 64bit floats (instead of 32bit) which might be the source of the memory hog. This might slow down the code if the computer runs out of memory and has to swap some stuff to disk. So I just applied a very simple fix that changes it back to 32bit, can you please try that? If it doesn't help, I'll try to further speed it up by rewriting the .bcor code in C instead of python, but let's try the simplest solution first.

Separately from this, I'm not sure what's the source of the error you sent me, so I updated the code to verify that the LD matrix only contains valid numbers. If the code crashes because it doesn't, that might mean that the LD file is corrupt. Can you please try out the updated code and let me know how it goes?

Thanks,

Omer

omerwe avatar Feb 19 '21 08:02 omerwe

Hi Omer,

Thanks - I'll try the new code and see how that works speed-wise! Don't worry about the Susie error - I think thats an issue with the way I've made the bcors.

Thanks again!

JoniColeman avatar Feb 19 '21 08:02 JoniColeman

Hi Omer,

Just wanted to add that the time lag is due to the loops in bcor.py. In finemapper.py the code requests to correlations (readcorr([])) for all the SNPs in the file/regions and the n x n loop is read one by one. a simple solution would be to fetch the upper or lower diagonal matrix.

jerome-f avatar Mar 16 '21 00:03 jerome-f

Thanks, the code already reads only the half-diagonal matrix (the .bcor file only stores half of the matrix). One cause for the loading time is that looping in Python is slow. If I find some time I'll try rewriting the code in Cython which could make it about 10x faster.

omerwe avatar Mar 16 '21 08:03 omerwe

Hi, Just to commend on this one too: I occurred upon the same issue with @JoniColeman when running though the fine-mapper script and using the --geno flag (i.e. genotype plink files) rather than UKB precomputed LD matrices. For certain chromosomes, the jobs ran fast and efficiently but for some others, it is taking 4-5 days. Also wondering whether this has to do with the memory or whether any file conversion could make these jobs ran faster. Thank you all!

mkoromina avatar Mar 04 '22 15:03 mkoromina

Hi, I suspect the reason for the slowness is simply that the ldstore code is slow. It could be made faster if it weren't written in Python but in a compiled language such as C. A possible workaround would be to write it in Cython, but this is a big undertaking and I don't have the capacity for this right now.

My suggestion is to convert the bcor to npz (using the --cache-dir flag), which would at least allow fast loading in future runs of the same LD matrix. If anyone has the capacity to rewrite the ldstore code in Cython, this would be a big improvement. I'm happy to guide that person (I did this in when I worked on the PolyFun paper, for a different version of the bcor format, but the new format is pretty different and I don't have the capacity to go back to this...)

omerwe avatar Mar 05 '22 18:03 omerwe

@omerwe and @jdblischak I found that calling ldstore2 to read the .bcor file and output the LD matrix into a text file then reading that back via np.loadtxt() takes far lesser time than bcor.readCorr(), I can create a PR to implement this routine

for 20k X 20k snps

ldstore2 --bcor-to-text takes about 3 min and 20 sec np.loadtxt()takes 40-42 sec

readCorr() on the other hand takes >25 mins

only caveat we need 5-6x more temporary disk space to store the text file briefly before converting it to npz.

jerome-f avatar Aug 03 '22 04:08 jerome-f

@jerome-f wow, that's crazy. I'm happy to accept a PR for this!

omerwe avatar Aug 03 '22 07:08 omerwe