methylpy
methylpy copied to clipboard
Running RMS tests failed.
Hello, I recently wanted to use methylpy to calculate DMR, I processed my methylation data into the allc file format,:
8 7524770 + CTCGC 15 15 1 8 7524782 + GGCGC 19 19 1 8 7524784 + CGCGA 20 20 1 8 7524822 + GCCGC 21 21 1 8 7524826 + CACGC 21 21 1 8 7524867 + AACGC 21 21 1 My methylation file contains methylation data on 11 chromosomes. I used the following command: methylpy DMRfind --allc-files guo_1.tsv hua_1.tsv --samples FR FL --mc-type "CGN" --chroms 1 2 3 4 5 6 7 8 9 10 11 --num-procs 8 --output-prefix DMR_FR_FL
But I got an error like this:
Filtering allc files using 2 node(s). Wed Jun 16 20:12:47 2021
Splitting allc files for chromosome 1 Wed Jun 16 20:12:56 2021
<class 'KeyError'> 179 '1' Running RMS tests failed.
I don't know what the problem is, I hope to advise. thank you.
Maybe it is the index file. Can you check whether the chromosome information is correctly stored in .idx
files? For example, guo_1.tsv.idx
and hua_1.tsv.idx
.
I don't have an index file, and my data is not BS-seq data. So I did not perform build-reference, Processing single-end data, and Processing paired-end data processes. I just changed my methylation result data to a similar allc file format, so the seventh column of my "allc file" was simply set to 1. Can methylpy software skip the previous steps and only calculate DMR?
Yes, DMRfind only needs allc files. Do you mind to share the two files for me to reproduce the error?
Ok.If this step can be achieved,it will be very helpful for me .Thank you very much. What is your email address? My file is a little big.
Can you reproduce the error with the first say 30 lines of allc files?
Ok.My methylation data is: HUA.tsv chr pos strand CG count_modified coverage 1 37 + GGCGG 3 4 1 38 - ACCGC 3 3 1 74 + ACCGC 5 5 1 75 - GGCGG 2 2 1 138 + GGCGG 3 4 1 206 + TTCGG 6 6 1 207 - GCCGA 3 3 1 210 + GCCGA 6 6 1 211 - ATCGG 5 5 1 222 + TTCGC 0 4 1 223 - AGCGA 3 3 1 228 + GACGG 4 5 1 229 - CCCGT 5 5 1 232 + GGCGG 4 4 1 233 - CCCGC 1 2 1 304 + CCCGA 1 1 1 305 - CTCGG 4 4 1 325 + ATCGC 5 5 1 326 - GGCGA 2 2 1 349 + ATCGG 5 5 1 350 - ACCGA 5 5 1 373 + GGCGG 3 4 1 374 - ACCGC 4 4 1 410 + CACGC 4 5 1 411 - GGCGT 5 5 1 418 + TTCGG 5 5 1 419 - GCCGA 5 5 1 483 + GGCGG 5 5 1 484 - ACCGC 3 3 1 503 + ATCGT 4 5 YE.tsv 1 37 + GGCGG 5 5 1 38 - ACCGC 3 5 1 74 + ACCGC 4 5 1 75 - GGCGG 3 3 1 138 + GGCGG 4 6 1 139 - ACCGC 3 3 1 206 + TTCGG 7 8 1 207 - GCCGA 6 6 1 210 + GCCGA 7 7 1 211 - ATCGG 7 7 1 222 + TTCGC 2 10 1 223 - AGCGA 5 5 1 228 + GACGG 8 8 1 229 - CCCGT 6 6 1 232 + GGCGG 9 9 1 233 - CCCGC 5 7 1 304 + CCCGA 5 5 1 305 - CTCGG 7 7 1 325 + ATCGC 7 7 1 326 - GGCGA 6 6 1 349 + ATCGG 8 8 1 350 - ACCGA 7 7 1 373 + GGCGG 8 8 1 374 - ACCGC 5 6 1 410 + CACGC 7 8 1 411 - GGCGT 7 7 1 418 + TTCGG 8 8 1 419 - GCCGA 4 4 1 483 + GGCGG 6 8 1 484 - ACCGC 6 6 But I have a problem. The seventh column of the allc file mentioned in the tutorial is to be calculated, but there is no way to get the seventh column from my file through the calculation in the tutorial. So I write 1 in the seventh column of the file. The file format is: 1 37 + GGCGG 3 4 1 1 38 - ACCGC 3 3 1 1 74 + ACCGC 5 5 1 1 75 - GGCGG 2 2 1 1 138 + GGCGG 3 4 1 1 206 + TTCGG 6 6 1 1 207 - GCCGA 3 3 1 1 210 + GCCGA 6 6 1 1 211 - ATCGG 5 5 1 1 222 + TTCGC 0 4 1 1 223 - AGCGA 3 3 1 1 228 + GACGG 4 5 1 1 229 - CCCGT 5 5 1 1 232 + GGCGG 4 4 1 1 233 - CCCGC 1 2 1 1 304 + CCCGA 1 1 1 1 305 - CTCGG 4 4 1 1 325 + ATCGC 5 5 1 1 326 - GGCGA 2 2 1 1 349 + ATCGG 5 5 1 1 350 - ACCGA 5 5 1 1 373 + GGCGG 3 4 1 1 374 - ACCGC 4 4 1 1 410 + CACGC 4 5 1 1 411 - GGCGT 5 5 1 1 418 + TTCGG 5 5 1 1 419 - GCCGA 5 5 1 1 483 + GGCGG 5 5 1 1 484 - ACCGC 3 3 1 1 503 + ATCGT 4 5 1 My command is: methylpy/bin/methylpy DMRfind --allc-files blue_guo_1.tsv blue_hua_1.tsv --samples FR FL --mc-type "CGN" --chroms 1 --output-prefix DMR_hua_1.tsv --samples guo_1 hua_1 --mc-type "CGN" --chroms 1 --output-prefix DMR_FR_FL Filtering allc files using single node. Mon Jun 21 11:20:19 2021
Splitting allc files for chromosome 1 Mon Jun 21 11:20:19 2021
<class 'KeyError'> 179 '1' Running RMS tests failed. Is it the reason that the seventh column of my allc file is not calculated?
I am also facing the same problem, please have you been able to find the error.
It is totally fine to set the last column to be 1. The current issue is that the context column (4th) format in the input file is not supported by methylpy. Reformatting the sequence context as the the last three bases should fix this problem. For example, ACCGC
-> CGC
where the first C is the cytosine of interest.
Hi is there a way to set --chroms 1 2 parameter to accept more than one string. For example my data is formated as NC_037328.1 but the map function splits it into ["N", "C", " _", "0", "3", "7", "3", "2", "8", ".", "1" ]. That is the cause of my error. Is there a way to set it, my data is very large and I am reluctant to reformat it?
Methylpy should be able to handle the chromosome names with more than one characters like chr1. Can you post the command you ran?
methylpy DMRfind
--allc-files all_files/allc_ARS-UCD1_CTRL1.tsv.gz all_files/allc_ARS-UCD1_CTRL2.tsv.gz
--samples ARS-UCD1_CTRL1 ARS-UCD1_CTRL2
--mc-type "CGN"
--chroms NC_037328.1
--num-procs 64
--output-prefix DMR_CTRL1_CTRL2
What version of methylpy are you using? I am not able to reproduce your error. Below are what I tried. Input files are attached. Are you able to run the below command without error?
methylpy DMRfind --allc-files allc_sample_1.tsv.gz allc_sample_2.tsv.gz --samples ARS-UCD1_CTRL1 ARS-UCD1_CTRL2 --mc-type "CGN" --chroms NC_037328.1 --num-procs 64 --output-prefix DMR_CTRL1_CTRL2
Input files: allc_sample_1.tsv.gz allc_sample_2.tsv.gz
I am using methylpy 1.4.3 version. The example you gave me works for me also but my input is not working.
This is the exact error that I get: Splitting allc files for chromosome NC_037328.1 Mon Jun 21 20:08:37 2021
<class 'KeyError'> 184 'NC_037328.1' Running RMS tests failed.
Do you mind to share the first 20 lines of your allc files?
NC_037328.1,28599,+,CAG,0,1,1 NC_037328.1,34167,+,CTG,0,2,1 NC_037328.1,47181,-,CAT,0,1,1 NC_037328.1,134883,-,CAT,0,1,1 NC_037328.1,138299,-,CAT,0,2,1 NC_037328.1,138300,+,CCT,0,2,1 NC_037328.1,138301,+,CTG,0,2,1 NC_037328.1,138303,-,CAG,0,2,1 NC_037328.1,138306,-,CAT,0,2,1 NC_037328.1,138310,+,CAC,0,2,1 NC_037328.1,138312,+,CAG,0,2,1 NC_037328.1,138314,-,CTG,0,2,1 NC_037328.1,138317,+,CAA,0,2,1 NC_037328.1,138320,-,CTT,0,2,1 NC_037328.1,138322,-,CAC,0,2,1 NC_037328.1,140407,-,CTA,0,4,1 NC_037328.1,140408,-,CCT,0,4,1 NC_037328.1,140409,+,CAA,0,4,1 NC_037328.1,145179,-,CAG,0,1,1 NC_037328.1,145180,-,CCA,0,1,1 NC_037328.1,145868,-,CAA,0,3,1 NC_037328.1,146655,+,CAA,1,5,1 NC_037328.1,149309,-,CAG,0,1,1 NC_037328.1,149359,-,CAG,0,1,1 NC_037328.1,149361,-,CAC,0,1,1 NC_037328.1,149364,-,CAT,0,1,1 NC_037328.1,152099,-,CAT,0,1,1 NC_037328.1,152107,-,CTA,0,1,1 NC_037328.1,152109,-,CAC,0,1,1 NC_037328.1,153427,-,CAT,0,1,1 NC_037328.1,153435,-,CTA,0,1,1 NC_037328.1,153437,-,CAC,0,1,1 NC_037328.1,156494,-,CAT,0,1,1 NC_037328.1,156496,+,CTC,0,1,1 NC_037328.1,156498,+,CAT,0,1,1 NC_037328.1,156502,-,CTA,0,2,1 NC_037328.1,156504,-,CAC,0,2,1 NC_037328.1,156505,+,CAC,0,1,1 NC_037328.1,156507,+,CTC,0,1,1 NC_037328.1,156509,+,CTT,0,1,1 NC_037328.1,156512,+,CAC,0,1,1 NC_037328.1,157799,-,CAT,0,2,1 NC_037328.1,157801,+,CTC,0,2,1 NC_037328.1,157803,+,CAT,0,2,1 NC_037328.1,157807,-,CTA,0,2,1 NC_037328.1,157809,-,CAC,0,2,1 NC_037328.1,157810,+,CAC,0,2,1 NC_037328.1,157812,+,CTC,0,3,1 NC_037328.1,157814,+,CTT,0,3,1 NC_037328.1,157817,+,CAC,0,3,1 NC_037328.1,157819,+,CCT,0,3,1 NC_037328.1,158294,-,CAA,0,3,1 NC_037328.1,158509,-,CCT,0,7,1 NC_037328.1,158559,+,CAT,0,5,1 NC_037328.1,158562,+,CAA,0,5,1 NC_037328.1,158566,+,CAG,0,5,1 NC_037328.1,158590,+,CGC,4,5,1 NC_037328.1,158591,-,CGG,5,7,1 NC_037328.1,158592,+,CTA,0,5,1 NC_037328.1,158596,-,CTT,0,7,1 NC_037328.1,158597,+,CTG,0,5,1 NC_037328.1,158599,-,CAG,0,7,1 NC_037328.1,158600,-,CCA,0,7,1 NC_037328.1,158601,+,CAA,0,5,1 NC_037328.1,158606,-,CAA,0,6,1 NC_037328.1,158608,+,CCA,0,5,1 NC_037328.1,158609,+,CAG,0,5,1 NC_037328.1,158611,-,CTG,0,6,1 NC_037328.1,158612,+,CTG,0,5,1 NC_037328.1,158614,-,CAG,0,6,1 NC_037328.1,158617,-,CAT,0,6,1 NC_037328.1,158619,+,CCA,0,5,1 NC_037328.1,158620,+,CAA,0,5,1 NC_037328.1,158623,-,CTT,0,3,1 NC_037328.1,159987,+,CTG,0,4,1 NC_037328.1,159989,-,CAG,0,8,1 NC_037328.1,161149,+,CAT,0,6,1 NC_037328.1,161153,+,CTG,0,7,1 NC_037328.1,161155,-,CAG,0,1,1 NC_037328.1,161156,+,CTA,0,7,1 NC_037328.1,161160,-,CTT,0,1,1 NC_037328.1,161161,+,CTG,0,6,1 NC_037328.1,161163,-,CAG,0,1,1 NC_037328.1,161165,+,CAA,0,6,1 NC_037328.1,161169,+,CAT,0,6,1 NC_037328.1,161172,+,CAA,1,6,1 NC_037328.1,161176,+,CAG,0,4,1 NC_037328.1,161229,+,CCA,0,2,1 NC_037328.1,161230,+,CAG,0,2,1 NC_037328.1,161287,+,CAT,0,2,1 NC_037328.1,161291,+,CAC,0,2,1 NC_037328.1,161293,+,CCT,0,2,1 NC_037328.1,161294,+,CTC,0,2,1 NC_037328.1,161296,+,CAA,0,2,1 NC_037328.1,162233,+,CTG,0,1,1 NC_037328.1,162235,-,CAG,0,5,1 NC_037328.1,162237,+,CCA,0,1,1 NC_037328.1,163011,-,CTG,0,3,1 NC_037328.1,163141,+,CAT,0,5,1 NC_037328.1,163144,-,CAT,0,3,1 NC_037328.1,163145,+,CTG,0,5,1 NC_037328.1,163147,-,CAG,0,3,1 NC_037328.1,163149,-,CTC,0,3,1 NC_037328.1,163151,+,CGT,4,5,1 NC_037328.1,163152,-,CGT,2,3,1 NC_037328.1,163154,-,CAC,0,3,1 NC_037328.1,163156,-,CAC,0,3,1 NC_037328.1,163160,+,CCT,0,6,1 NC_037328.1,163161,+,CTG,0,6,1 NC_037328.1,163163,-,CAG,0,3,1 NC_037328.1,163164,+,CTT,0,6,1 NC_037328.1,163168,+,CTC,0,6,1 NC_037328.1,163170,+,CAG,0,6,1 NC_037328.1,163172,-,CTG,0,3,1 NC_037328.1,163173,+,CTG,0,6,1 NC_037328.1,163175,-,CAG,0,3,1 NC_037328.1,163176,+,CTG,0,6,1 NC_037328.1,163321,+,CGC,4,5,1 NC_037328.1,163322,-,CGC,1,1,1 NC_037328.1,163323,+,CAT,0,4,1 NC_037328.1,163347,+,CAT,0,4,1 NC_037328.1,163351,+,CAG,0,4,1 NC_037328.1,163353,-,CTG,0,1,1 NC_037328.1,163355,+,CAC,0,4,1 NC_037328.1,163357,+,CGT,2,4,1 NC_037328.1,163358,-,CGT,1,2,1 NC_037328.1,163360,-,CAC,0,1,1 NC_037328.1,163361,+,CTC,0,4,1 NC_037328.1,163363,+,CAC,0,4,1 NC_037328.1,163365,+,CTA,0,4,1 NC_037328.1,163368,+,CCT,0,4,1 NC_037328.1,163369,+,CTG,0,4,1 NC_037328.1,163371,-,CAG,0,1,1 NC_037328.1,163372,+,CTC,0,4,1 NC_037328.1,163374,+,CAG,0,4,1 NC_037328.1,163376,-,CTG,0,1,1 NC_037328.1,163378,+,CAT,0,4,1 NC_037328.1,163463,-,CAA,0,1,1 NC_037328.1,163465,-,CAC,0,1,1 NC_037328.1,163466,+,CAA,0,2,1 NC_037328.1,163470,+,CAA,0,2,1 NC_037328.1,163473,-,CTT,0,1,1 NC_037328.1,163474,-,CCT,0,1,1 NC_037328.1,163475,-,CCC,0,1,1 NC_037328.1,163478,+,CAG,0,2,1 NC_037328.1,163480,-,CTG,0,1,1 NC_037328.1,163481,+,CAC,0,2,1 NC_037328.1,163483,+,CAT,0,2,1 NC_037328.1,163570,-,CTT,0,1,1 NC_037328.1,163572,-,CAC,0,1,1 NC_037328.1,163858,-,CGA,3,3,1 NC_037328.1,163859,-,CCG,0,3,1 NC_037328.1,163860,-,CCC,0,3,1 NC_037328.1,163863,+,CAG,0,3,1 NC_037328.1,163865,-,CTG,0,3,1 NC_037328.1,163867,-,CAC,0,3,1 NC_037328.1,163868,+,CAT,0,3,1 NC_037328.1,164178,-,CAA,0,4,1 NC_037328.1,164499,+,CAG,0,2,1 NC_037328.1,164501,-,CTG,0,2,1 NC_037328.1,164662,+,CAC,0,5,1 NC_037328.1,164664,+,CGA,4,5,1 NC_037328.1,164668,+,CTG,0,5,1 NC_037328.1,164671,+,CCA,0,5,1 NC_037328.1,164672,+,CAC,0,5,1 NC_037328.1,164674,+,CGT,1,5,1 NC_037328.1,164843,-,CAT,1,3,1 NC_037328.1,165849,-,CAC,0,3,1 NC_037328.1,166080,+,CAT,0,3,1 NC_037328.1,166083,+,CTG,0,3,1 NC_037328.1,166085,-,CAG,0,2,1 NC_037328.1,166086,+,CTT,0,3,1 NC_037328.1,166383,-,CAA,0,2,1 NC_037328.1,166384,+,CGT,0,1,1 NC_037328.1,166385,-,CGC,1,2,1 NC_037328.1,166387,-,CAC,0,2,1 NC_037328.1,166392,-,CGA,1,2,1 NC_037328.1,166396,-,CGA,1,2,1 NC_037328.1,166398,-,CAC,0,2,1 NC_037328.1,166403,-,CTG,0,2,1 NC_037328.1,166707,+,CAT,0,1,1 NC_037328.1,166712,+,CAT,0,1,1 NC_037328.1,168185,+,CAA,0,6,1 NC_037328.1,168188,-,CTT,0,2,1 NC_037328.1,168189,+,CCT,0,6,1 NC_037328.1,168190,+,CTC,0,6,1 NC_037328.1,168192,+,CAG,0,6,1 NC_037328.1,168194,-,CTG,0,2,1 NC_037328.1,168195,-,CCT,0,2,1 NC_037328.1,168681,+,CTC,0,2,1 NC_037328.1,168683,+,CTG,0,2,1 NC_037328.1,168685,-,CAG,0,3,1 NC_037328.1,168686,-,CCA,0,3,1 NC_037328.1,168691,-,CTT,0,3,1 NC_037328.1,168692,+,CTG,0,2,1 NC_037328.1,168694,-,CAG,0,3,1 NC_037328.1,168769,+,CCA,0,2,1 NC_037328.1,168770,+,CAC,0,2,1 NC_037328.1,168810,+,CGA,2,2,1 NC_037328.1,168815,+,CAT,0,2,1 NC_037328.1,168819,+,CTT,0,1,1
Ah, the fields need to be tab separated. Can we try fixing the format and running DMRfind?
Its tab separated; the split I replace the split with coma.
Howvere doing a print gives this:
NC_037328.1 28599 + CAG 0 1 1
NC_037328.1 34167 + CTG 0 2 1
NC_037328.1 47181 - CAT 0 1 1
NC_037328.1 134883 - CAT 0 1 1
NC_037328.1 138299 - CAT 0 2 1
NC_037328.1 138300 + CCT 0 2 1
NC_037328.1 138301 + CTG 0 2 1
NC_037328.1 138303 - CAG 0 2 1
NC_037328.1 138306 - CAT 0 2 1
NC_037328.1 138310 + CAC 0 2 1
NC_037328.1 138312 + CAG 0 2 1
NC_037328.1 138314 - CTG 0 2 1
NC_037328.1 138317 + CAA 0 2 1
NC_037328.1 138320 - CTT 0 2 1
NC_037328.1 138322 - CAC 0 2 1
NC_037328.1 140407 - CTA 0 4 1
NC_037328.1 140408 - CCT 0 4 1
NC_037328.1 140409 + CAA 0 4 1
NC_037328.1 145179 - CAG 0 1 1
NC_037328.1 145180 - CCA 0 1 1
NC_037328.1 145868 - CAA 0 3 1
NC_037328.1 146655 + CAA 1 5 1
NC_037328.1 149309 - CAG 0 1 1
could the extra space be the problem?
It seems that there are no CGN sites in your allc files. Is that correct? If so, that could be the cause of the problem.
Do you also get the same error by running this?
methylpy DMRfind
--allc-files all_files/allc_ARS-UCD1_CTRL1.tsv.gz all_files/allc_ARS-UCD1_CTRL2.tsv.gz
--samples ARS-UCD1_CTRL1 ARS-UCD1_CTRL2
--mc-type "CAG"
--chroms NC_037328.1
--num-procs 64
--output-prefix DMR_CTRL1_CTRL2