genomation icon indicating copy to clipboard operation
genomation copied to clipboard

readBed parsing failure

Open socanas opened this issue 5 years ago • 8 comments

I have been using the readBed feature of genomation to read Bed files into R for use in Enriched Heatmaps. Recently I have been getting a parsing failures error, ie.:

Warning: 62474 parsing failures. row col expected actual file 199 X4 no trailing characters .3333 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt' 430 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt' 1046 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt'

Any number in the bed file that has a decimal value is changed to NA in the GRanges object. I do not want to truncate or round these values. I am sure that I am missing something simple. I have used the readBed function for similar files and never have had an issue. Any suggestions?

Thanks!

socanas avatar Jun 04 '19 13:06 socanas

This could be due to the potential changes in the dependencies we use to read the files faster. If you send a reproducible example I will take a look

On Tue 4. Jun 2019 at 15:17, socanas [email protected] wrote:

I have been using the readBed feature of genomation to read Bed files into R for use in Enriched Heatmaps. Recently I have been getting a parsing failures error, ie.:

Warning: 62474 parsing failures. row col expected actual file 199 X4 no trailing characters .3333 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt' 430 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt' 1046 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt'

Any number in the bed file that has a decimal value is changed to NA in the GRanges object. I do not want to truncate or round these values. I am sure that I am missing something simple. I have used the readBed function for similar files and never have had an issue. Any suggestions?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BIMSBbioinfo/genomation/issues/185?email_source=notifications&email_token=AAE32ENHASZCDAPFPBSOYKLPYZTP7A5CNFSM4HS4IGG2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GXQ3C7Q, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE32EO27HKRBZF27MJG23DPYZTP7ANCNFSM4HS4IGGQ .

-- Sent from mobile, excuse the brevity

al2na avatar Jun 04 '19 13:06 al2na

Thank you for your quick reply! The file test1.txt gives the parsing error and test2.txt does not give the parsing error. Both files were formatted with the same scripts. Columns are "chr, start, stop, methylation %, coverage, strand (+/-)".

test1<-readBed("test1.txt", remove.unusual=TRUE) Parsed with column specification: cols( X1 = col_character(), X2 = col_double(), X3 = col_double(), X4 = col_double(), X5 = col_double(), X6 = col_character() ) Warning: 12 parsing failures. row col expected actual file 199 X4 no trailing characters .3333 'test1.txt' 430 X4 no trailing characters .6667 'test1.txt' 1046 X4 no trailing characters .6667 'test1.txt' 1127 X4 no trailing characters .3333 'test1.txt' 1199 X4 no trailing characters .6667 'test1.txt' .... ... ...................... ...... ........... See problems(...) for more details.

test2<-readBed("test2.txt", remove.unusual=TRUE) Parsed with column specification: cols( X1 = col_character(), X2 = col_double(), X3 = col_double(), X4 = col_double(), X5 = col_double(), X6 = col_character() )

test1.txt test2.txt

socanas avatar Jun 04 '19 14:06 socanas

Hi @socanas, file test1.txt gives the parsing error and test2.txt does not give the parsing error, because readBed uses first 30 rows to detect classes of columns (character, integer, decimal numbers etc) and in the test1 file your 4th column in the first 30 rows doesnt have a decimal number, but in test2 file you have them. I would just add .0 to one of the first 30 numbers in your column for now, e.g. the first one:

chr1 3000827 3000827 100.0 1 + chr1 3001007 3001007 100 1 + chr1 3001018 3001018 100 2 + chr1 3001019 3001019 100 1 - chr1 3003339 3003339 100 1 + chr1 3003340 3003340 100 1 - chr1 3003380 3003380 100 1 - chr1 3003582 3003582 100 2 +

cheers, Kasia

katwre avatar Jun 05 '19 15:06 katwre

hmm, I've also had such issue in the past, maybe rewriting a function that reads BED files https://github.com/BIMSBbioinfo/genomation/blob/master/R/readData.R#L68 from readr::read_delim to e.g. data.table::fread could be an option, @al2na what do you think? Right now it's a bit dangerous, because readTableFast might truncate decimal numbers, if e.g. in the 30 rows are only integers and 31st is a decimal number

katwre avatar Jun 05 '19 16:06 katwre

@katwre Thank you for the solution! That works great!

socanas avatar Jun 05 '19 16:06 socanas

hmm, I've also had such issue in the past, maybe rewriting a function that reads BED files https://github.com/BIMSBbioinfo/genomation/blob/master/R/readData.R#L68 from readr::read_delim to e.g. data.table::fread could be an option, @al2na what do you think? Right now it's a bit dangerous, because readTableFast might truncate decimal numbers, if e.g. in the 30 rows are only integers and 31st is a decimal number

Thank you @katwre !! I think we went with read_delim because it could read gzipped files at the time, but now data.table::fread can also read gzipped files without piping afaik. if that's the case we can use fread

al2na avatar Jun 05 '19 17:06 al2na

@al2na ah true! but it looks like fread reads gzziped files now too, so it could work

> data.table::fread("test2.txt.gz")
        V1      V2      V3  V4 V5 V6
   1: chr1 3001630 3001630 100  2  -
   2: chr1 3003227 3003227 100  1  -
   3: chr1 3003340 3003340 100  2  -
   4: chr1 3003380 3003380   0  1  -
   5: chr1 3003582 3003582 100  1  +
  ---                               
1996: chr1 3670743 3670743   0  1  -
1997: chr1 3670752 3670752   0  1  -
1998: chr1 3670776 3670776   0  2  +
1999: chr1 3670821 3670821   0  1  +
2000: chr1 3670861 3670861   0  1  +

katwre avatar Jun 05 '19 17:06 katwre

then we can change it to fread:)

On Wed, Jun 5, 2019 at 7:48 PM katwre [email protected] wrote:

@al2na https://github.com/al2na ah true! but it looks like fread reads gzziped files now too, so it could work

data.table::fread("test2.txt.gz") V1 V2 V3 V4 V5 V6 1: chr1 3001630 3001630 100 2 - 2: chr1 3003227 3003227 100 1 - 3: chr1 3003340 3003340 100 2 - 4: chr1 3003380 3003380 0 1 - 5: chr1 3003582 3003582 100 1 +


1996: chr1 3670743 3670743 0 1 - 1997: chr1 3670752 3670752 0 1 - 1998: chr1 3670776 3670776 0 2 + 1999: chr1 3670821 3670821 0 1 + 2000: chr1 3670861 3670861 0 1 +

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BIMSBbioinfo/genomation/issues/185?email_source=notifications&email_token=AAE32EIYH4VAIXIYL5XCXDTPY736LA5CNFSM4HS4IGG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXAPYQY#issuecomment-499186755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE32EMJHVSIDI636753IW3PY736LANCNFSM4HS4IGGQ .

al2na avatar Jun 05 '19 21:06 al2na