modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Parsing input error

Open Ge0rges opened this issue 1 year ago • 10 comments

Hi @ArtRand,

The following command which I believe to have executed on identical files in the past (perhaps on 0.3.0) seem to produce the error below now:

modkit dmr multi \
  -s methylation_10/brevundimonas_r-contigs/barcode01.bed.gz top \
  -s methylation_10/brevundimonas_r-contigs/barcode02.bed.gz middle \
  -s methylation_10/brevundimonas_r-contigs/barcode03.bed.gz bottom \
  -s methylation_10/brevundimonas_r-contigs/barcode05.bed.gz top \
  -s methylation_10/brevundimonas_r-contigs/barcode06.bed.gz middle \
  -s methylation_10/brevundimonas_r-contigs/barcode07.bed.gz bottom \
  -s methylation_10/brevundimonas_r-contigs/barcode08.bed.gz top \
  -s methylation_10/brevundimonas_r-contigs/barcode09.bed.gz middle \
  -s methylation_10/brevundimonas_r-contigs/barcode10.bed.gz bottom \
  -s methylation_10/brevundimonas_r-contigs/barcode11.bed.gz barcode11 \
  -s methylation_10/brevundimonas_r-contigs/barcode12.bed.gz barcode12 \
  -s methylation_10/brevundimonas_r-contigs/barcode13.bed.gz barcode13 \
  -s methylation_10/brevundimonas_r-contigs/barcode14.bed.gz barcode14 \
  -r methylation_10/brevundimonas_r-contigs/gene-coordinates.txt \
  -o methylation_10/brevundimonas_r-contigs/dmr_by_gene/ \
  -t 20 \
  --ref mags/brevundimonas_r-contigs.fna \
  --base C \
  --base A \
  --min-valid-coverage 10

Error: > Error! Parsing Error: Error { input: "\t\t", code: Many1 }

Is this due to a change/misformat in my input files that I might have missed or does it seem like a bug in modkit? The error is a buit mysterious.

Ge0rges avatar Jul 15 '24 15:07 Ge0rges

@Ge0rges,

I agree, the parsing errors should be more informative. I'll fix that.

Could you tell me which version of modkit you used to generate the input data (the pileups)? Also could you attach or paste the gene-coordinates.txt file? (email is also fine).

ArtRand avatar Jul 15 '24 15:07 ArtRand

I used 0.3.1, also the gene-coordinates file is the issue, just looked at it and it's not normal. Guess that was the issue! I'll fix it and confirm.

Ge0rges avatar Jul 15 '24 15:07 Ge0rges

Seems like that fixed it @ArtRand next time I'll review my input files instead of trusting the script! Sneaky updates sneak pass me...

Ge0rges avatar Jul 15 '24 17:07 Ge0rges

@Ge0rges I'm going to re-open this issue to track work for better error messages when input fails to parse. Some other users have encountered the same error and it's not clear enough what the problem is.

ArtRand avatar Jul 15 '24 22:07 ArtRand

Hi @ArtRand,

I've also encountered a parsing error - I'm trying to run the script below, attempting to use the regions.bed.gz files as output from wf_human_variation --mod function. Have also tried with the wf_mods.bedmethyl.gz.

For the -r /regions-bed, I download the NCBI refseq track in bed format.

Define variables for paths

REF="/projects/health_sciences/oms/pathology/powry48p/202404ONT/reference/ref_genome/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" OUT_DIR="/weka/powry48p/results/modkit_output/"

Run modkit dmr

./modkit dmr multi
-s barcode17.regions.bed.gz Tri102_1
-s barcode19.regions.bed.gz Tri102_2
-s barcode21.regions.bed.gz Tri103_1
-s barcode23.regions.bed.gz Tri103_2
-o $OUT_DIR
-r refseq.bed
--ref $REF
-m C
--log-filepath dmr_multi.log

Error:

error fetching line from regions BED, stream did not contain valid UTF-8 error fetching line from regions BED, stream did not contain valid UTF-8 Error! Parsing Error: Error { input: "= {", code: Digit }

Any tips would be appreciated, thanks!

Rpowellnz avatar Jul 31 '24 05:07 Rpowellnz

Hello @Rpowellnz,

Could you tell what

$ head -n 5 refseq.bed 

looks like?

ArtRand avatar Aug 01 '24 04:08 ArtRand

Hi @ArtRand,

The output from $ head -n 5 refseq.bed is as below, which I'm guessing is not correctly formatted.. Could you provide some guidance on how to generate the appropriate .bed file for -r/ for a genome-wide differential methylation analysis of protein coding genes?

bplist00�_WebMainResource�

_ebResourceTextEncodingName_WebResourceData_WebResourceMIMEType_WebResourceFrameName^WebResourceURLUUTF-8O�S

chr1	201283451	201332993	NM_000299	0	+	201283702	201328836	0	15	453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920,	0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622,
chr1	67092165	67134970	NM_001276351	0	-	6709300467127240	0	8	1439,187,70,113,158,92,86,41,	0,3069,4086,23186,33586,35000,38976,42764,
chr1	201283505	201332989	NM_001005337	0	+	201283702	201328836	0	14	399,104,395,145,208,178,115,156,177,154,187,85,107,2916,	0,10436,29660,33047,34066,35112,36761,38472,39507,40922,41435,42248,45256,46568,
chr1	67092165	67134970	NM_001276352	0	-	6709357967127240	0	9	1439,70,145,68,113,158,92,86,41,	0,4086,11072,19411,23186,33586,35000,38976,42764,

Rpowellnz avatar Aug 01 '24 20:08 Rpowellnz

Hello @Rpowellnz,

You certainly need to remove any of those HTML tags at the start. The BED file should be a plain text file with 3 or 4 tab-separated fields: chrom, start, end, <name> (<name> is optional). You should also remove those blank lines.

ArtRand avatar Aug 01 '24 21:08 ArtRand

Hi @ArtRand

I removed the HTML tags so now $ head -n refseq1.bed produces the output below.

chr1 201283451 201332993 NM_000299 0 + 201283702 201328836 0 15 453,104,395,145,208,178,63,115,156,177,154,187,85,107,2920, 0,10490,29714,33101,34120,35166,36364,36815,38526,39561,40976,41489,42302,45310,46622, chr1 67092165 67134970 NM_001276351 0 - 67093004 67127240 0 8 1439,187,70,113,158,92,86,41, 0,3069,4086,23186,33586,35000,38976,42764, chr1 201283505 201332989 NM_001005337 0 + 201283702 201328836 0 14 399,104,395,145,208,178,115,156,177,154,187,85,107,2916, 0,10436,29660,33047,34066,35112,36761,38472,39507,40922,41435,42248,45256,46568, chr1 67092165 67134970 NM_001276352 0 - 67093579 67127240 0 9 1439,70,145,68,113,158,92,86,41, 0,4086,11072,19411,23186,33586,35000,38976,42764, chr1 67092165 67134970 NR_075077 0 - 67134970 67134970 0 10 1439,70,145,68,143,113,158,92,86,41, 0,4086,11072,19411,21448,23186,33586,35000,38976,42764,

Trying to run modkit dmr as below, still produces the error

./modkit dmr multi
-s barcode17.regions.bed.gz Tri102_1
-s barcode19.regions.bed.gz Tri102_2
-s barcode21.regions.bed.gz Tri103_1
-s barcode23.regions.bed.gz Tri103_2
-o $OUT_DIR
-r refseq1.bed
--ref $REF
-m C
--log-filepath dmr_multi.log

Error! Parsing Error: Error { input: "= {", code: Digit }

Rpowellnz avatar Aug 01 '24 22:08 Rpowellnz

@Rpowellnz The latest version will report out which file is failing to parse. Could you confirm that it's an issue with the argument to -r (the regions file)?

ArtRand avatar Aug 27 '24 14:08 ArtRand