methylKit icon indicating copy to clipboard operation
methylKit copied to clipboard

Improve methRead input validation

Open MjelleLab opened this issue 4 years ago • 7 comments

I have a methRead object that is further merged using unite. It consists of 5 samples. When running clusterSamples, I get the following error:

clusterSamples(Unite.object, dist="correlation", method="ward.D", plot=TRUE,) Error in quantile.default(sds, sd.threshold) : missing values and NaN's not allowed if 'na.rm' is FALSE

MjelleLab avatar Dec 17 '19 15:12 MjelleLab

Hi @biocoreexpert,

could you please send a minimal reproducible code example? Otherwise we can't figure out what is wrong internally. Best,Alex

alexg9010 avatar Dec 17 '19 15:12 alexg9010

new("methylBase", .Data = list(c(1L, 1L, 1L, 1L, 1L, 1L), c(9L, 13L, 14L, 17L, 19L, 20L), c(9L, 13L, 14L, 17L, 19L, 20L), structure(c(1L, 2L, 2L, 2L, 2L, 2L), .Label = c("-", "+"), class = "factor"), c(7L, 10L, 12L, 18L, 22L, 22L), c(0, 0, 0, 1, 2, 1), c(0, 1, 1, 2, 3, 4), c(5L, 12L, 14L, 25L, 28L, 28L), c(0, 0, 1, 4, 2, 4), c(0, 1, 1, 3, 6, 4), c(3L, 5L, 6L, 12L, 14L, 14L ), c(0, 0, 0, 0, 1, 0), c(0, 0, 0, 1, 1, 2), c(6L, 12L, 12L, 16L, 23L, 23L), c(0, 0, 0, 1, 2, 3), c(0, 1, 1, 2, 3, 3), c(5L, 4L, 8L, 14L, 17L, 17L), c(0, 0, 0, 1, 2, 2), c(0, 0, 1, 1, 1, 1)), sample.ids = c("sample1", "sample2", "sample3", "sample4", "sample5"), assembly = "tair9", context = "CpG", treatment = c(1, 1, 1, 1, 1), coverage.index = c(5, 8, 11, 14, 17), numCs.index = c(6, 9, 12, 15, 18), numTs.index = c(7, 10, 13, 16, 19), destranded = FALSE, resolution = "base", names = c("chr", "start", "end", "strand", "coverage1", "numCs1", "numTs1", "coverage2", "numCs2", "numTs2", "coverage3", "numCs3", "numTs3", "coverage4", "numCs4", "numTs4", "coverage5", "numCs5", "numTs5"), row.names = 1:6, .S3Class = "data.frame")

MjelleLab avatar Dec 17 '19 15:12 MjelleLab

Thanks for the hint, I can reproduce the error with the data you provided and I will work on a check to catch these cases that we are missing.

But, I still assume that something went wrong in the way you read in your data.

Because if you look at the object that has been created after uniting, than you might see that for every sample the sum of numCs and numTs does not add up to the coverage value. This is quite unusual as by default rows with coverage less than 5 reads will be filtered out when using methRead.

methylBase object with 6 rows
--------------
  chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 coverage5 numCs5 numTs5
1   1     9   9      -         7      0      0         5      0      0         3      0      0         6      0      0         5      0      0
2   1    13  13      +        10      0      1        12      0      1         5      0      0        12      0      1         4      0      0
3   1    14  14      +        12      0      1        14      1      1         6      0      0        12      0      1         8      0      1
4   1    17  17      +        18      1      2        25      4      3        12      0      1        16      1      2        14      1      1
5   1    19  19      +        22      2      3        28      2      6        14      1      1        23      2      3        17      2      1
6   1    20  20      +        22      1      4        28      4      4        14      0      2        23      3      3        17      2      1
--------------
sample.ids: sample1 sample2 sample3 sample4 sample5 
destranded FALSE 
assembly: tair9 
context: CpG 
treament: 1 1 1 1 1 
resolution: base 

alexg9010 avatar Dec 17 '19 17:12 alexg9010

I notices and error in the input reading using methRead. The numCs and numTs are not correct, and do not match what is in the original files. Do you know why? I am using this function to read the files:

methRead(location = list("C_S3.filter.csv","SL_B1.filter.csv","SL_B4.filter.csv","SL_S2.filter.csv","SL_S5.filter.csv"), sample.id = list("sample1","sample2","sample3","sample4","sample5"), assembly = "tair9", resolution = "base", treatment=c(1,1,1,1,1), mincov = 0 )

First file looks like this:

chrBase chr base Strand coverage freqC freqT
Contig11 1 1 F 0 0 0
Contig12 1 2 F 0 0 0
Contig14 1 4 R 1 0 1
Contig19 1 9 R 5 0 5
Contig110 1 10 F 7 2 5
Contig111 1 11 F 11 2 9
Contig113 1 13 F 12 3 9
Contig114 1 14 F 14 5 9
Contig117 1 17 F 25 14 11
Contig119 1 19 F 28 6 22
Contig120 1 20 F 28 14 14
Contig121 1 21 F 28 7 21
Contig122 1 22 F 26 5 21

On Tue., 17 Dec. 2019, 18:00 Alexander Blume (Gosdschan), < [email protected]> wrote:

Thanks for the hint, I can reproduce the error with the data you provided and I will work on a check to catch these cases that we are missing.

But, I still assume that something went wrong in the way you read in your data.

Because if you look at the object that has been created after uniting, than you might see that for every sample the sum of numCs and numTs does not add up to the coverage value. This is quite unusual as by default rows with coverage less than 5 reads will be filtered out when using methRead.

methylBase object with 6 rows

chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2 coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 coverage5 numCs5 numTs5 1 1 9 9 - 7 0 0 5 0 0 3 0 0 6 0 0 5 0 0 2 1 13 13 + 10 0 1 12 0 1 5 0 0 12 0 1 4 0 0 3 1 14 14 + 12 0 1 14 1 1 6 0 0 12 0 1 8 0 1 4 1 17 17 + 18 1 2 25 4 3 12 0 1 16 1 2 14 1 1 5 1 19 19 + 22 2 3 28 2 6 14 1 1 23 2 3 17 2 1 6 1 20 20 + 22 1 4 28 4 4 14 0 2 23 3 3 17 2 1

sample.ids: sample1 sample2 sample3 sample4 sample5 destranded FALSE assembly: tair9 context: CpG treament: 1 1 1 1 1 resolution: base

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/174?email_source=notifications&email_token=AHLPZ6UDTWXFAFB35HMLRVTQZEAUNA5CNFSM4J35LEP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHDHHZA#issuecomment-566653924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHLPZ6WG3NH2GUUUF2X4USDQZEAUNANCNFSM4J35LEPQ .

MjelleLab avatar Dec 17 '19 17:12 MjelleLab

Seems the error is that I dont have ratios for the freqC/freqT.

MjelleLab avatar Dec 17 '19 22:12 MjelleLab

Yes, you are right, this is what is causing your problem. An easy fix would be to read every file in and replace the value of the freqC/T column with freqX/coverage for C and T respectively. And then write that out again to a file and start fresh with methRead.

alexg9010 avatar Dec 18 '19 14:12 alexg9010

I need to enhance input validation for methRead

  • [ ] add check whether numCs + numTs = coverage
  • [ ] profile how much time above check takes

Maybe we can provide an option for the generic format of methRead to indicate numbers instead of frequencies.

alexg9010 avatar Apr 07 '20 12:04 alexg9010