methylpy icon indicating copy to clipboard operation
methylpy copied to clipboard

Running RMS tests failed.

Open chenji333 opened this issue 3 years ago • 18 comments

Hello, I recently wanted to use methylpy to calculate DMR, I processed my methylation data into the allc file format,:

8 7524770 + CTCGC 15 15 1 8 7524782 + GGCGC 19 19 1 8 7524784 + CGCGA 20 20 1 8 7524822 + GCCGC 21 21 1 8 7524826 + CACGC 21 21 1 8 7524867 + AACGC 21 21 1 My methylation file contains methylation data on 11 chromosomes. I used the following command: methylpy DMRfind --allc-files guo_1.tsv hua_1.tsv --samples FR FL --mc-type "CGN" --chroms 1 2 3 4 5 6 7 8 9 10 11 --num-procs 8 --output-prefix DMR_FR_FL

But I got an error like this:

Filtering allc files using 2 node(s). Wed Jun 16 20:12:47 2021

Splitting allc files for chromosome 1 Wed Jun 16 20:12:56 2021

<class 'KeyError'> 179 '1' Running RMS tests failed.

I don't know what the problem is, I hope to advise. thank you.

chenji333 avatar Jun 16 '21 12:06 chenji333

Maybe it is the index file. Can you check whether the chromosome information is correctly stored in .idx files? For example, guo_1.tsv.idx and hua_1.tsv.idx.

yupenghe avatar Jun 17 '21 04:06 yupenghe

I don't have an index file, and my data is not BS-seq data. So I did not perform build-reference, Processing single-end data, and Processing paired-end data processes. I just changed my methylation result data to a similar allc file format, so the seventh column of my "allc file" was simply set to 1. Can methylpy software skip the previous steps and only calculate DMR?

chenji333 avatar Jun 17 '21 07:06 chenji333

Yes, DMRfind only needs allc files. Do you mind to share the two files for me to reproduce the error?

yupenghe avatar Jun 19 '21 05:06 yupenghe

Ok.If this step can be achieved,it will be very helpful for me .Thank you very much. What is your email address? My file is a little big.

chenji333 avatar Jun 19 '21 14:06 chenji333

Can you reproduce the error with the first say 30 lines of allc files?

yupenghe avatar Jun 19 '21 16:06 yupenghe

Ok.My methylation data is: HUA.tsv chr pos strand CG count_modified coverage 1 37 + GGCGG 3 4 1 38 - ACCGC 3 3 1 74 + ACCGC 5 5 1 75 - GGCGG 2 2 1 138 + GGCGG 3 4 1 206 + TTCGG 6 6 1 207 - GCCGA 3 3 1 210 + GCCGA 6 6 1 211 - ATCGG 5 5 1 222 + TTCGC 0 4 1 223 - AGCGA 3 3 1 228 + GACGG 4 5 1 229 - CCCGT 5 5 1 232 + GGCGG 4 4 1 233 - CCCGC 1 2 1 304 + CCCGA 1 1 1 305 - CTCGG 4 4 1 325 + ATCGC 5 5 1 326 - GGCGA 2 2 1 349 + ATCGG 5 5 1 350 - ACCGA 5 5 1 373 + GGCGG 3 4 1 374 - ACCGC 4 4 1 410 + CACGC 4 5 1 411 - GGCGT 5 5 1 418 + TTCGG 5 5 1 419 - GCCGA 5 5 1 483 + GGCGG 5 5 1 484 - ACCGC 3 3 1 503 + ATCGT 4 5 YE.tsv 1 37 + GGCGG 5 5 1 38 - ACCGC 3 5 1 74 + ACCGC 4 5 1 75 - GGCGG 3 3 1 138 + GGCGG 4 6 1 139 - ACCGC 3 3 1 206 + TTCGG 7 8 1 207 - GCCGA 6 6 1 210 + GCCGA 7 7 1 211 - ATCGG 7 7 1 222 + TTCGC 2 10 1 223 - AGCGA 5 5 1 228 + GACGG 8 8 1 229 - CCCGT 6 6 1 232 + GGCGG 9 9 1 233 - CCCGC 5 7 1 304 + CCCGA 5 5 1 305 - CTCGG 7 7 1 325 + ATCGC 7 7 1 326 - GGCGA 6 6 1 349 + ATCGG 8 8 1 350 - ACCGA 7 7 1 373 + GGCGG 8 8 1 374 - ACCGC 5 6 1 410 + CACGC 7 8 1 411 - GGCGT 7 7 1 418 + TTCGG 8 8 1 419 - GCCGA 4 4 1 483 + GGCGG 6 8 1 484 - ACCGC 6 6 But I have a problem. The seventh column of the allc file mentioned in the tutorial is to be calculated, but there is no way to get the seventh column from my file through the calculation in the tutorial. So I write 1 in the seventh column of the file. The file format is: 1 37 + GGCGG 3 4 1 1 38 - ACCGC 3 3 1 1 74 + ACCGC 5 5 1 1 75 - GGCGG 2 2 1 1 138 + GGCGG 3 4 1 1 206 + TTCGG 6 6 1 1 207 - GCCGA 3 3 1 1 210 + GCCGA 6 6 1 1 211 - ATCGG 5 5 1 1 222 + TTCGC 0 4 1 1 223 - AGCGA 3 3 1 1 228 + GACGG 4 5 1 1 229 - CCCGT 5 5 1 1 232 + GGCGG 4 4 1 1 233 - CCCGC 1 2 1 1 304 + CCCGA 1 1 1 1 305 - CTCGG 4 4 1 1 325 + ATCGC 5 5 1 1 326 - GGCGA 2 2 1 1 349 + ATCGG 5 5 1 1 350 - ACCGA 5 5 1 1 373 + GGCGG 3 4 1 1 374 - ACCGC 4 4 1 1 410 + CACGC 4 5 1 1 411 - GGCGT 5 5 1 1 418 + TTCGG 5 5 1 1 419 - GCCGA 5 5 1 1 483 + GGCGG 5 5 1 1 484 - ACCGC 3 3 1 1 503 + ATCGT 4 5 1 My command is: methylpy/bin/methylpy DMRfind --allc-files blue_guo_1.tsv blue_hua_1.tsv --samples FR FL --mc-type "CGN" --chroms 1 --output-prefix DMR_hua_1.tsv --samples guo_1 hua_1 --mc-type "CGN" --chroms 1 --output-prefix DMR_FR_FL Filtering allc files using single node. Mon Jun 21 11:20:19 2021

Splitting allc files for chromosome 1 Mon Jun 21 11:20:19 2021

<class 'KeyError'> 179 '1' Running RMS tests failed. Is it the reason that the seventh column of my allc file is not calculated?

chenji333 avatar Jun 21 '21 03:06 chenji333

I am also facing the same problem, please have you been able to find the error.

frimpz avatar Jun 21 '21 13:06 frimpz

It is totally fine to set the last column to be 1. The current issue is that the context column (4th) format in the input file is not supported by methylpy. Reformatting the sequence context as the the last three bases should fix this problem. For example, ACCGC -> CGC where the first C is the cytosine of interest.

yupenghe avatar Jun 22 '21 04:06 yupenghe

Hi is there a way to set --chroms 1 2 parameter to accept more than one string. For example my data is formated as NC_037328.1 but the map function splits it into ["N", "C", " _", "0", "3", "7", "3", "2", "8", ".", "1" ]. That is the cause of my error. Is there a way to set it, my data is very large and I am reluctant to reformat it?

frimpz avatar Jun 22 '21 05:06 frimpz

Methylpy should be able to handle the chromosome names with more than one characters like chr1. Can you post the command you ran?

yupenghe avatar Jun 22 '21 05:06 yupenghe

methylpy DMRfind
--allc-files all_files/allc_ARS-UCD1_CTRL1.tsv.gz all_files/allc_ARS-UCD1_CTRL2.tsv.gz
--samples ARS-UCD1_CTRL1 ARS-UCD1_CTRL2
--mc-type "CGN"
--chroms NC_037328.1
--num-procs 64
--output-prefix DMR_CTRL1_CTRL2

frimpz avatar Jun 22 '21 05:06 frimpz

What version of methylpy are you using? I am not able to reproduce your error. Below are what I tried. Input files are attached. Are you able to run the below command without error?

methylpy DMRfind --allc-files allc_sample_1.tsv.gz allc_sample_2.tsv.gz --samples ARS-UCD1_CTRL1 ARS-UCD1_CTRL2 --mc-type "CGN" --chroms NC_037328.1 --num-procs 64 --output-prefix DMR_CTRL1_CTRL2

Input files: allc_sample_1.tsv.gz allc_sample_2.tsv.gz

yupenghe avatar Jun 22 '21 06:06 yupenghe

I am using methylpy 1.4.3 version. The example you gave me works for me also but my input is not working.

This is the exact error that I get: Splitting allc files for chromosome NC_037328.1 Mon Jun 21 20:08:37 2021

<class 'KeyError'> 184 'NC_037328.1' Running RMS tests failed.

frimpz avatar Jun 22 '21 06:06 frimpz

Do you mind to share the first 20 lines of your allc files?

yupenghe avatar Jun 22 '21 06:06 yupenghe

NC_037328.1,28599,+,CAG,0,1,1 NC_037328.1,34167,+,CTG,0,2,1 NC_037328.1,47181,-,CAT,0,1,1 NC_037328.1,134883,-,CAT,0,1,1 NC_037328.1,138299,-,CAT,0,2,1 NC_037328.1,138300,+,CCT,0,2,1 NC_037328.1,138301,+,CTG,0,2,1 NC_037328.1,138303,-,CAG,0,2,1 NC_037328.1,138306,-,CAT,0,2,1 NC_037328.1,138310,+,CAC,0,2,1 NC_037328.1,138312,+,CAG,0,2,1 NC_037328.1,138314,-,CTG,0,2,1 NC_037328.1,138317,+,CAA,0,2,1 NC_037328.1,138320,-,CTT,0,2,1 NC_037328.1,138322,-,CAC,0,2,1 NC_037328.1,140407,-,CTA,0,4,1 NC_037328.1,140408,-,CCT,0,4,1 NC_037328.1,140409,+,CAA,0,4,1 NC_037328.1,145179,-,CAG,0,1,1 NC_037328.1,145180,-,CCA,0,1,1 NC_037328.1,145868,-,CAA,0,3,1 NC_037328.1,146655,+,CAA,1,5,1 NC_037328.1,149309,-,CAG,0,1,1 NC_037328.1,149359,-,CAG,0,1,1 NC_037328.1,149361,-,CAC,0,1,1 NC_037328.1,149364,-,CAT,0,1,1 NC_037328.1,152099,-,CAT,0,1,1 NC_037328.1,152107,-,CTA,0,1,1 NC_037328.1,152109,-,CAC,0,1,1 NC_037328.1,153427,-,CAT,0,1,1 NC_037328.1,153435,-,CTA,0,1,1 NC_037328.1,153437,-,CAC,0,1,1 NC_037328.1,156494,-,CAT,0,1,1 NC_037328.1,156496,+,CTC,0,1,1 NC_037328.1,156498,+,CAT,0,1,1 NC_037328.1,156502,-,CTA,0,2,1 NC_037328.1,156504,-,CAC,0,2,1 NC_037328.1,156505,+,CAC,0,1,1 NC_037328.1,156507,+,CTC,0,1,1 NC_037328.1,156509,+,CTT,0,1,1 NC_037328.1,156512,+,CAC,0,1,1 NC_037328.1,157799,-,CAT,0,2,1 NC_037328.1,157801,+,CTC,0,2,1 NC_037328.1,157803,+,CAT,0,2,1 NC_037328.1,157807,-,CTA,0,2,1 NC_037328.1,157809,-,CAC,0,2,1 NC_037328.1,157810,+,CAC,0,2,1 NC_037328.1,157812,+,CTC,0,3,1 NC_037328.1,157814,+,CTT,0,3,1 NC_037328.1,157817,+,CAC,0,3,1 NC_037328.1,157819,+,CCT,0,3,1 NC_037328.1,158294,-,CAA,0,3,1 NC_037328.1,158509,-,CCT,0,7,1 NC_037328.1,158559,+,CAT,0,5,1 NC_037328.1,158562,+,CAA,0,5,1 NC_037328.1,158566,+,CAG,0,5,1 NC_037328.1,158590,+,CGC,4,5,1 NC_037328.1,158591,-,CGG,5,7,1 NC_037328.1,158592,+,CTA,0,5,1 NC_037328.1,158596,-,CTT,0,7,1 NC_037328.1,158597,+,CTG,0,5,1 NC_037328.1,158599,-,CAG,0,7,1 NC_037328.1,158600,-,CCA,0,7,1 NC_037328.1,158601,+,CAA,0,5,1 NC_037328.1,158606,-,CAA,0,6,1 NC_037328.1,158608,+,CCA,0,5,1 NC_037328.1,158609,+,CAG,0,5,1 NC_037328.1,158611,-,CTG,0,6,1 NC_037328.1,158612,+,CTG,0,5,1 NC_037328.1,158614,-,CAG,0,6,1 NC_037328.1,158617,-,CAT,0,6,1 NC_037328.1,158619,+,CCA,0,5,1 NC_037328.1,158620,+,CAA,0,5,1 NC_037328.1,158623,-,CTT,0,3,1 NC_037328.1,159987,+,CTG,0,4,1 NC_037328.1,159989,-,CAG,0,8,1 NC_037328.1,161149,+,CAT,0,6,1 NC_037328.1,161153,+,CTG,0,7,1 NC_037328.1,161155,-,CAG,0,1,1 NC_037328.1,161156,+,CTA,0,7,1 NC_037328.1,161160,-,CTT,0,1,1 NC_037328.1,161161,+,CTG,0,6,1 NC_037328.1,161163,-,CAG,0,1,1 NC_037328.1,161165,+,CAA,0,6,1 NC_037328.1,161169,+,CAT,0,6,1 NC_037328.1,161172,+,CAA,1,6,1 NC_037328.1,161176,+,CAG,0,4,1 NC_037328.1,161229,+,CCA,0,2,1 NC_037328.1,161230,+,CAG,0,2,1 NC_037328.1,161287,+,CAT,0,2,1 NC_037328.1,161291,+,CAC,0,2,1 NC_037328.1,161293,+,CCT,0,2,1 NC_037328.1,161294,+,CTC,0,2,1 NC_037328.1,161296,+,CAA,0,2,1 NC_037328.1,162233,+,CTG,0,1,1 NC_037328.1,162235,-,CAG,0,5,1 NC_037328.1,162237,+,CCA,0,1,1 NC_037328.1,163011,-,CTG,0,3,1 NC_037328.1,163141,+,CAT,0,5,1 NC_037328.1,163144,-,CAT,0,3,1 NC_037328.1,163145,+,CTG,0,5,1 NC_037328.1,163147,-,CAG,0,3,1 NC_037328.1,163149,-,CTC,0,3,1 NC_037328.1,163151,+,CGT,4,5,1 NC_037328.1,163152,-,CGT,2,3,1 NC_037328.1,163154,-,CAC,0,3,1 NC_037328.1,163156,-,CAC,0,3,1 NC_037328.1,163160,+,CCT,0,6,1 NC_037328.1,163161,+,CTG,0,6,1 NC_037328.1,163163,-,CAG,0,3,1 NC_037328.1,163164,+,CTT,0,6,1 NC_037328.1,163168,+,CTC,0,6,1 NC_037328.1,163170,+,CAG,0,6,1 NC_037328.1,163172,-,CTG,0,3,1 NC_037328.1,163173,+,CTG,0,6,1 NC_037328.1,163175,-,CAG,0,3,1 NC_037328.1,163176,+,CTG,0,6,1 NC_037328.1,163321,+,CGC,4,5,1 NC_037328.1,163322,-,CGC,1,1,1 NC_037328.1,163323,+,CAT,0,4,1 NC_037328.1,163347,+,CAT,0,4,1 NC_037328.1,163351,+,CAG,0,4,1 NC_037328.1,163353,-,CTG,0,1,1 NC_037328.1,163355,+,CAC,0,4,1 NC_037328.1,163357,+,CGT,2,4,1 NC_037328.1,163358,-,CGT,1,2,1 NC_037328.1,163360,-,CAC,0,1,1 NC_037328.1,163361,+,CTC,0,4,1 NC_037328.1,163363,+,CAC,0,4,1 NC_037328.1,163365,+,CTA,0,4,1 NC_037328.1,163368,+,CCT,0,4,1 NC_037328.1,163369,+,CTG,0,4,1 NC_037328.1,163371,-,CAG,0,1,1 NC_037328.1,163372,+,CTC,0,4,1 NC_037328.1,163374,+,CAG,0,4,1 NC_037328.1,163376,-,CTG,0,1,1 NC_037328.1,163378,+,CAT,0,4,1 NC_037328.1,163463,-,CAA,0,1,1 NC_037328.1,163465,-,CAC,0,1,1 NC_037328.1,163466,+,CAA,0,2,1 NC_037328.1,163470,+,CAA,0,2,1 NC_037328.1,163473,-,CTT,0,1,1 NC_037328.1,163474,-,CCT,0,1,1 NC_037328.1,163475,-,CCC,0,1,1 NC_037328.1,163478,+,CAG,0,2,1 NC_037328.1,163480,-,CTG,0,1,1 NC_037328.1,163481,+,CAC,0,2,1 NC_037328.1,163483,+,CAT,0,2,1 NC_037328.1,163570,-,CTT,0,1,1 NC_037328.1,163572,-,CAC,0,1,1 NC_037328.1,163858,-,CGA,3,3,1 NC_037328.1,163859,-,CCG,0,3,1 NC_037328.1,163860,-,CCC,0,3,1 NC_037328.1,163863,+,CAG,0,3,1 NC_037328.1,163865,-,CTG,0,3,1 NC_037328.1,163867,-,CAC,0,3,1 NC_037328.1,163868,+,CAT,0,3,1 NC_037328.1,164178,-,CAA,0,4,1 NC_037328.1,164499,+,CAG,0,2,1 NC_037328.1,164501,-,CTG,0,2,1 NC_037328.1,164662,+,CAC,0,5,1 NC_037328.1,164664,+,CGA,4,5,1 NC_037328.1,164668,+,CTG,0,5,1 NC_037328.1,164671,+,CCA,0,5,1 NC_037328.1,164672,+,CAC,0,5,1 NC_037328.1,164674,+,CGT,1,5,1 NC_037328.1,164843,-,CAT,1,3,1 NC_037328.1,165849,-,CAC,0,3,1 NC_037328.1,166080,+,CAT,0,3,1 NC_037328.1,166083,+,CTG,0,3,1 NC_037328.1,166085,-,CAG,0,2,1 NC_037328.1,166086,+,CTT,0,3,1 NC_037328.1,166383,-,CAA,0,2,1 NC_037328.1,166384,+,CGT,0,1,1 NC_037328.1,166385,-,CGC,1,2,1 NC_037328.1,166387,-,CAC,0,2,1 NC_037328.1,166392,-,CGA,1,2,1 NC_037328.1,166396,-,CGA,1,2,1 NC_037328.1,166398,-,CAC,0,2,1 NC_037328.1,166403,-,CTG,0,2,1 NC_037328.1,166707,+,CAT,0,1,1 NC_037328.1,166712,+,CAT,0,1,1 NC_037328.1,168185,+,CAA,0,6,1 NC_037328.1,168188,-,CTT,0,2,1 NC_037328.1,168189,+,CCT,0,6,1 NC_037328.1,168190,+,CTC,0,6,1 NC_037328.1,168192,+,CAG,0,6,1 NC_037328.1,168194,-,CTG,0,2,1 NC_037328.1,168195,-,CCT,0,2,1 NC_037328.1,168681,+,CTC,0,2,1 NC_037328.1,168683,+,CTG,0,2,1 NC_037328.1,168685,-,CAG,0,3,1 NC_037328.1,168686,-,CCA,0,3,1 NC_037328.1,168691,-,CTT,0,3,1 NC_037328.1,168692,+,CTG,0,2,1 NC_037328.1,168694,-,CAG,0,3,1 NC_037328.1,168769,+,CCA,0,2,1 NC_037328.1,168770,+,CAC,0,2,1 NC_037328.1,168810,+,CGA,2,2,1 NC_037328.1,168815,+,CAT,0,2,1 NC_037328.1,168819,+,CTT,0,1,1

frimpz avatar Jun 22 '21 06:06 frimpz

Ah, the fields need to be tab separated. Can we try fixing the format and running DMRfind?

yupenghe avatar Jun 22 '21 06:06 yupenghe

Its tab separated; the split I replace the split with coma.

Howvere doing a print gives this:

NC_037328.1 28599 + CAG 0 1 1

NC_037328.1 34167 + CTG 0 2 1

NC_037328.1 47181 - CAT 0 1 1

NC_037328.1 134883 - CAT 0 1 1

NC_037328.1 138299 - CAT 0 2 1

NC_037328.1 138300 + CCT 0 2 1

NC_037328.1 138301 + CTG 0 2 1

NC_037328.1 138303 - CAG 0 2 1

NC_037328.1 138306 - CAT 0 2 1

NC_037328.1 138310 + CAC 0 2 1

NC_037328.1 138312 + CAG 0 2 1

NC_037328.1 138314 - CTG 0 2 1

NC_037328.1 138317 + CAA 0 2 1

NC_037328.1 138320 - CTT 0 2 1

NC_037328.1 138322 - CAC 0 2 1

NC_037328.1 140407 - CTA 0 4 1

NC_037328.1 140408 - CCT 0 4 1

NC_037328.1 140409 + CAA 0 4 1

NC_037328.1 145179 - CAG 0 1 1

NC_037328.1 145180 - CCA 0 1 1

NC_037328.1 145868 - CAA 0 3 1

NC_037328.1 146655 + CAA 1 5 1

NC_037328.1 149309 - CAG 0 1 1

could the extra space be the problem?

frimpz avatar Jun 22 '21 06:06 frimpz

It seems that there are no CGN sites in your allc files. Is that correct? If so, that could be the cause of the problem.

Do you also get the same error by running this?

methylpy DMRfind
--allc-files all_files/allc_ARS-UCD1_CTRL1.tsv.gz all_files/allc_ARS-UCD1_CTRL2.tsv.gz
--samples ARS-UCD1_CTRL1 ARS-UCD1_CTRL2
--mc-type "CAG"
--chroms NC_037328.1
--num-procs 64
--output-prefix DMR_CTRL1_CTRL2

yupenghe avatar Jun 22 '21 18:06 yupenghe