BRAKER icon indicating copy to clipboard operation
BRAKER copied to clipboard

Not out of the woods yet: uninitialized values, can't open TGA.mat, and more.

Open JohnUrban opened this issue 1 year ago • 4 comments

Hi again,

I am starting a new issue here as it is technically now a new issue, but this is still part of my journey established in now-monstrous-in-size recent threads:

  • #577
  • #582

I am now trying to get Braker3 to run 3 ways:

  1. Provide Fastq and let Braker3 do all the work.
  2. Provide BAMs produced by HiSat2:
hisat2 -x ${HISAT2IDX} -U ${FQ} --rna-strandness R --dta -p 16 | samtools view -bh -F 4 | samtools sort --threads 16 > hisat2-stranded.bam
  1. Provide BAMs produced by STAR
STAR --outSAMattributes All --outSAMstrandField intronMotif --twopassMode Basic --genomeDir ${STARIDX} --runThreadN ${THREADS} --readFilesIn ${FQ} --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate

For both HiSat2 and STAR BAMs, I've tried both providing the entire BAM:

braker.pl --genome=${ASM} --bam=${RNA} --prot_seq=${PROTEINS} --workingdir=braker3 --threads=16

Or strand-separated BAMs (this is strand-specific RNA-seq data).

braker.pl --genome=${ASM} --stranded=+,- --bam=${FWD},${REV} --prot_seq=${PROTEINS} --workingdir=braker3 --threads=16

All these seemed to work on toy experiments (a single ~1 Mb contig, 10% of the RNA-seq data, all the protein data). There were no error messages in the GeneMark.std* files.

However, they are raising some messages in the GeneMark.std* files on the full-fledged runs using the 500 Mb genome, 100% RNA-seq data (>1 billion reads), and all protein data (sanitized ODBv11 and some other stuff mixed in).

Approaches 1 and 2 above (fastq or hisat2 bam) give the same error message in GeneMark.stderr (stdout file seems fine) -- example is specifically from the Fastq approach:

FASTA index file /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3-odb11-fq/unstranded/braker3/GeneMark-ETP/data/genome.softmasked.fasta.fai created.
07-Mar-23 23:04:17 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04
07-Mar-23 23:04:17 - INFO: Running prediction with masking penalty = 0
07-Mar-23 23:17:09 - INFO: Running prediction with masking penalty = 0.04
07-Mar-23 23:30:03 - INFO: Running prediction with masking penalty = 0.08
07-Mar-23 23:42:52 - INFO: Running prediction with masking penalty = 0.12
07-Mar-23 23:55:33 - INFO: Running prediction with masking penalty = 0.16
08-Mar-23 00:08:35 - INFO: Running prediction with masking penalty = 0.2
08-Mar-23 00:21:22 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.08 with step 0.02
08-Mar-23 00:21:22 - INFO: Running prediction with masking penalty = 0.02
08-Mar-23 00:34:02 - INFO: Running prediction with masking penalty = 0.06
08-Mar-23 00:46:48 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0.02 to 0.06 with step 0.01
08-Mar-23 00:46:48 - INFO: Running prediction with masking penalty = 0.03
08-Mar-23 00:59:32 - INFO: Running prediction with masking penalty = 0.05
08-Mar-23 01:12:20 - INFO: Selected baseline penalty for the maximum # of correct reliable predictions: 0.04
08-Mar-23 01:12:20 - INFO: Running prediction with masking penalty = 0.1
08-Mar-23 01:25:02 - INFO: Running prediction with masking penalty = 0.11
08-Mar-23 01:37:47 - INFO: Masking penalty was set to 0.11
substr outside of string at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 681, <$PARSE> line 32195.
Use of uninitialized value $value in substitution (s///) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 682, <$PARSE> line 32195.
Use of uninitialized value $value in substitution (s///) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 683, <$PARSE> line 32195.
Use of uninitialized value $value in concatenation (.) or string at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 685, <$PARSE> line 32195.
substr outside of string at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 681, <$PARSE> line 240316.
Use of uninitialized value $value in substitution (s///) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 682, <$PARSE> line 240316.
Use of uninitialized value $value in substitution (s///) at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 683, <$PARSE> line 240316.
Use of uninitialized value $value in concatenation (.) or string at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/parse_set.pl line 685, <$PARSE> line 240316.

The toy experiments had neither the "INFO"/masking-penalty messages, not the "uninitialized value" messages.

Despite the messages in the file, the run finished, and there are ~23k genes in the Braker GTF. Nonetheless, I don't know what to think about these messages. Do they have any bearing on the final GTF?

For approach 3 (STAR BAMs), again the toy experiments went fine, but the full-fledged version raised some messages in both the stderr and stdout files for GeneMark. Here is an example of GeneMark.stderr:

FASTA index file /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/star/fr/braker3/GeneMark-ETP/data/genome.softmasked.fasta.fai created.
error on open file /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/build_mod.pl: TGA.mat
sed: can't read output.mod: No such file or directory
10-Mar-23 03:18:20 - INFO: Finding masking penalty maximizing the number of correctly predicted reliable exons in range from 0 to 0.2 with step 0.04
10-Mar-23 03:18:20 - INFO: Running prediction with masking penalty = 0
error: Program exited due to an error in command: /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/gmes_petap.pl --seq /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/star/fr/braker3/GeneMark-ETP/proteins.fa/penalty/contigsblr6w09d.fasta --soft_mask 1000 --max_mask 40000  --predict_with /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/star/fr/braker3/GeneMark-ETP/proteins.fa/model/output.mod --cores 16 --mask_penalty 0
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Traceback (most recent call last):
  File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 353, in <module>
    main()
  File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 289, in main
    candidates = loadIntrons(args.genemark)
  File "/central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/printRnaAlternatives.py", line 193, in loadIntrons
    for row in csv.reader(open(inputFile), delimiter='\t'):
FileNotFoundError: [Errno 2] No such file or directory: 'pred_m/genemark.gtf'
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
error on open file /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/gmes/build_mod.pl: TGA.mat
sed: can't read output.mod: No such file or directory
error, file not found: option --f1 prothint/prothint.gff
grep: prothint/evidence.gff: No such file or directory
grep: prothint/evidence.gff: No such file or directory
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.
Died at /central/groups/carnegie_poc/jurban/software/braker2/braker3/deps/genemark-etp/alt/ETP/bin/format_back.pl line 14.

And here is an example of GeneMark.stdout from the STAR bam run:


14169	14169	0	100.00	complete.gtf
14169	14169	0	100.00	complete.gtf

# from file complete.id parsed IDs: 14170
# not found in input: 1
# done
primary_contig_1
primary_contig_2
....
....
primary_contig_182
primary_contig_184
# from file training.list parsed IDs: 12259
# not found in input: 0
# done
error: all sequences not of same lengthon line 144: GTC


31286	27688	3598	88.50	p_hints_nonhc.gtf
412347	27688	384659	6.71	r_hints_nonhc.gtf

# number of transcripts in file: 14169 /central/groups/carnegie_poc/jurban/data/coral/combined-nanopore/annotation/canu_primary/04-braker3/star/fr/braker3/GeneMark-ETP/proteins.fa/genemark_supported.gtf
# number of genes in set: 12259
# removed partial: 0
# genes found for training: 12259
error: all sequences not of same lengthon line 690: GTC

Note that all "toy" and "full-fledged jobs" were launched on SLURM in identical ways except for resources allotted. The toy jobs were given 8 tasks and 50 GB RAM. For the "full-fledged jobs", all were given 16 tasks and 100 GB RAM:

sbatch --time=72:00:00 --mem=100G --nodes=1 --ntasks=16 -J ${JOBNAME} -o slurm-${JOBNAME}-%A.out --export=ALL ${SCRIPT}

So I do wonder if I need to allocate more RAM. That would be an easy solution. Nonetheless, I am skeptical b/c if the RAM limit was exceeded, my jobs would simply have been killed.

Any thoughts, suggestions, and/or advice welcome.

Best,

John

JohnUrban avatar Mar 10 '23 16:03 JohnUrban

The STAR BAM error(s) may be related to issue #588

This is based on:

prothint/evidence.gff: No such file or directory

JohnUrban avatar Mar 10 '23 20:03 JohnUrban

For the STAR BAM problem, I found some more hints to what may be going on.

./braker3/GeneMark-ETP/proteins.fa/nonhc/for_prothint/loginfo

# check before the run
error, numerical value is expected for mask_penalty: error,

./braker3/GeneMark-ETP/proteins.fa/nonhc/prothint/log

error: File "../for_prothint/genemark.gtf" was not found.

./braker3/GeneMark-ETP/proteins.fa/nonhc/pred_m/loginfo

error, numerical value is expected for mask_penalty: error,

JohnUrban avatar Mar 10 '23 20:03 JohnUrban

Comment on RAM usage: When run with 32 threads my runs only used up to 15G max despite being provided with much more. They did complete but all 18 of them produced the same errors mentioned in the first post of this thread, on genomes less than 100Mb in size.

webbchen avatar Apr 20 '23 15:04 webbchen

@KatharinaHoff Just to support @JohnUrban 's observations: I get the same set of errors in GeneMark.stderr even when using hisat2 (v.2.2.1) instead of STAR. I have a mixture of single-end and paired-end bam's.

alephreish avatar May 03 '23 12:05 alephreish