scrnaseq icon indicating copy to clipboard operation
scrnaseq copied to clipboard

Alevin workflow cannot handle gzipped genomes or gtf files

Open grst opened this issue 3 years ago • 1 comments

Tested on DSL2 version in dev.

Either update modules to handle gzip files, or include an optional gunzip process.

grst avatar Dec 30 '21 10:12 grst

Superseded by #78

grst avatar Jun 17 '22 15:06 grst

@grst : I think this issue might still persist? I just tried to run version 2.1.0 of the pipeline with gzip compressed genome FASTA & GTF files:

[Truncated nextflow console output]

Command executed:
  filter_gtf_for_genes_in_genome.py \
      --gtf gencode.v42.primary_assembly.basic.annotation.gtf.gz \
      --fasta GRCh38.primary_assembly.genome.fa.gz \
      -o GRCh38.primary_assembly.genome.fa_genes.gtf
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SCRNASEQ:SCRNASEQ:GTF_GENE_FILTER":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS
Command exit status:
  1
Command output:
  (empty)
Command error:
  Traceback (most recent call last):
    File "/root/nextflow-bin/filter_gtf_for_genes_in_genome.py", line 82, in
      extract_genes_in_genome(args.fasta, args.gtf, args.output)
    File "/root/nextflow-bin/filter_gtf_for_genes_in_genome.py", line 43, in extract_genes_in_genome
      seq_names_in_genome = set(extract_fasta_seq_names(fasta))
    File "/root/nextflow-bin/filter_gtf_for_genes_in_genome.py", line 34, in extract_fasta_seq_names
      for i, header in enumerate(faiter):
    File "/root/nextflow-bin/filter_gtf_for_genes_in_genome.py", line 32, in
      faiter = (x[1] for x in groupby(fh, is_header))
    File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
      (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

tomsing1 avatar Nov 01 '22 21:11 tomsing1