methylpy icon indicating copy to clipboard operation
methylpy copied to clipboard

Suggestions about methylpy

Open lhqing opened this issue 5 years ago • 2 comments

Hi, Yupeng,

Some suggestions of methylpy based on my experience. I can work on improve these (since I already did these here https://github.com/lhqing/cemba_data/blob/master/cemba_data/tools/). But I'd like to know your opinions:

  1. chromosome names: I suggest always keep the chromosome name same as genome fasta, i.e. keep "chr" by default for genomes like mm10 and hg19, or never change chromosome names. I found this will decrease conflits in a lot cases.

  2. chromosome orders: Right now the order of chrom in ALLC is based on bam file, and there is no paticular "sorted or not" check for ALLC file. What's worse is in the merge_allc_files_worker function, the order of chrom may be random, because dict.keys() is not ordered. https://github.com/yupenghe/methylpy/blob/methylpy/methylpy/utilities.py#L543

  3. Use bgzip/tabix:

    1. Much faster then gzip.open() + f.seek
    2. allow more even parallel in funcs like merge ALLC (instead of chromosome level using current index, tabix allow query ALLC in any bin level) I known this need a lot changes in all functions, but I found it worthwhile. I implemented the open_allc and merge_allc function based on bgzip/tabix which can be several times faster then gzip + f.seek strategy using old index. I tend to fully drop the old index strategy.
  4. More universal map_to_region function: which is very important for single cell, my implementation here: map_to_region Given an ALLC file, this function count mC and cov for a list of regions, e.g. chrom 100kb bins, gene bodies, I use bedtools map internally, which is very fast.

lhqing avatar Mar 25 '19 02:03 lhqing

Thanks for the suggestions. Here are my thoughts.

  1. I totally agree with you and removing "chr" prefix did cause a lot of confusion.
  2. Not sure if this is important especially when point 3 is implemented as long as the position is sorted within each chromosome.
  3. Agree. I haven't test tabix yet but it could totally be a better strategy.
  4. Agree. Coverage and count are also important stats and should be reported.

I also have some thoughts that would like your comments.

  1. allc file should be separated into two files: one for CG sites and one for non-CG sites, where the generation of the latter could be optional. Most projects only care about CG sites which are a very small subset of all sites.
  2. Since pandas is extremely popular and tall data format is very powerful, I am thinking of changing some file formats (e.g. methylation level of a list of regions in a list of samples) into the tall data format.
  3. Reimplement the RMS test with python. Installing GSL library seems a bottleneck for methylpy installation.

yupenghe avatar Oct 05 '19 18:10 yupenghe

Hi Yupeng, sorry I've missed this for a while...

  1. I agree, and this is not hard to implement, but this is quite a dramatic change, I guess shouldn't make this as default option.

  2. Totally agree, and not only tall data, but saving data in the plain text file also impact performance, especially when having >100 samples (for single-cell data, we can easily have that), the DMS table is huge, we could use some custom tall data format store in HDF5.

  3. This will be great, cython can probably do the same job.

In addition, actually I've implemented quite some ALLC related functions in another repo here https://github.com/lhqing/ALLCools, which basically implement everything related to ALLC handling using tabix/bgzip, add added some code for single-cell related stuff. I am willing to incorporate that into methylpy.

I think I will have time on this after next Feb 2020

lhqing avatar Nov 30 '19 08:11 lhqing