CIRCexplorer2 icon indicating copy to clipboard operation
CIRCexplorer2 copied to clipboard

Annotation with 0 fusions junctions (STAR alignment)

Open andre-gabriel-42 opened this issue 7 months ago • 1 comments

Hello.. I am currently having an issue with the annotation part. I am using Linux to perform CIRCexplorer2 analysis.

After aligning everything with STAR as described by you,

for instance:

STAR --runThreadN 10 --genomeDir /mnt/work1/genodata/STAR-GRCh38.99/ --readFilesIn /mnt/work2/andre_phd/circRNA_Jan2023/raw/T18Und_circRNA_1.fastq.gz /mnt/work2/andre_phd/circRNA_Jan2023/raw/T18Und_circRNA_2.fastq.gz --readFilesCommand zcat --chimSegmentMin 10

Afterwards, I did the parsing step as follows:

CIRCexplorer2 parse -t STAR Chimeric.out.junction > CIRCexplorer2_Diff0_T18und_parse.log

and I was successuful.

Then, when I tried annotating, I used the following command and got the following mistake:

andregabriel@brutus:~/Desktop/Alignments_André_CIRCexplorer/Diff0_T18und_circRNA$ CIRCexplorer2 annotate -r /home/andregabriel/Desktop/Alignments_André_CIRCexplorer/Homo_sapiens.GRCh38.99.gtf -g /mnt/work1/genodata/STAR-GRCh38.99/Homo_sapiens.GRCh38.dna.primary_assembly.fa -b back_spliced_junction.bed -o circularRNA_Diff0_T18und.txt > CIRCexplorer2_Diff0_T18und_annotate.log

Traceback (most recent call last):
  File "/home/andregabriel/.local/bin/CIRCexplorer2", line 8, in <module>
    sys.exit(main())
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/command_parse.py", line 50, in main
    annotate.annotate(docopt(annotate.__doc__, version=__version__),
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/helper.py", line 38, in wrapper
    fn(*args)
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/annotate.py", line 37, in annotate
    annotate_fusion(options['--ref'], options['--bed'], fusion_tmp,
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/annotate.py", line 50, in annotate_fusion
    genes, novel_genes, gene_info, chrom_info = parse_ref(ref_f, 1)
  File "/home/andregabriel/.local/lib/python3.10/site-packages/circ2/parser.py", line 76, in parse_ref
    gene_id, iso_id, chrom, strand = line.split()[:4]
ValueError: not enough values to unpack (expected 4, got 2)

While exploring the GitHub repository of the initial version of CIRCexplorer, I came across the information that for successful annotation, the GTF file needs to be of the "Gene Predictions and RefSeq Genes with Gene Names" type, essentially a GenePred file (https://genome.ucsc.edu/FAQ/FAQformat.html#format9).

According to the documentation, the expected format for the gene annotation file is as follows:

geneName: Name of the gene isoformName: Name of the isoform chrom: Reference sequence strand: + or - for the strand txStart: Transcription start position txEnd: Transcription end position cdsStart: Coding region start cdsEnd: Coding region end exonCount: Number of exons exonStarts: Exon start positions exonEnds: Exon end positions (https://circexplorer2.readthedocs.io/en/2.3.0/modules/annotate/).

As such, I used gtfToGenePred to convert the GRCh38/99 GTF file. At this time CIRCexplorer no longer generates errors when using the annotation command. However, it still fails to annotate the obtained junctions, despite not encountering any errors this time.

Annotated 0 fusion junctions!
Start to fix fusion junctions...
Fixed 0 fusion junctions!

I attempted to change the approach and tested it with the hg19 genome instead, just to check if it would work. Although I haven't had much success yet, I've outlined the following steps that I used:

So:

  1. hg19 Genome Preparation: I used STAR to index the genome. STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /mnt/work1/genodata/hg19 --genomeFastaFiles /mnt/work1/genodata/hg19/bwa_index/hg19.bowtie.fa --sjdbGTFfile /mnt/work1/genodata/hg19/IlluminaiGenomes/UCSC_genes.gtf --sjdbOverhang 100 --limitGenomeGenerateRAM 300 --outFileNamePrefix /home/andregabriel/Desktop/Alignments_André_CIRCexplorer_hg19_Mock/D17Und_Diff0/

  2. Alignment: I used STAR to align one of the samples with the newly indexed genome. STAR --runThreadN 6 --genomeDir /mnt/work1/genodata/hg19/Indexed_hg19_STAR/ --readFilesIn /mnt/work2/andre_phd/circRNA_Jan2023/raw/T17Und_circRNA_1.fastq.gz /mnt/work2/andre_phd/circRNA_Jan2023/raw/T17Und_circRNA_2.fastq.gz --readFilesCommand zcat --chimSegmentMin 10

  3. Parsing: I utilized the CIRCexplorer2 parse command to extract relevant information from the alignment results, focusing specifically on chimeric junctions. Note that for this sample when I used GRCh38, CIRCexplorer2 parse converted 156628 fusion reads, whereas with hg19, it converted 162404 fusion reads. So, parsing continues to identify fusion reads consistently. It's interesting to see the variation, but I assume this is normal. CIRCexplorer2 parse -t STAR Chimeric.out.junction > CIRCexplorer2_Diff0_T17und_parse.log

  4. gtfToGenePred: I used the gtfToGenePred command to convert the GTF file into a Gene Prediction file, which I just assume that it should facilitate the annotation of fusion junctions in the next step (5). gtfToGenePred -genePredExt -geneNameAsName2 /mnt/work1/genodata/hg19/IlluminaiGenomes/UCSC_genes.gtf UCSC_geneGTFtoGENEPRED.txt

  5. Annotation: Finally, I attempted to use the CIRCexplorer2 annotate command to annotate fusion junctions based on the fusion reads (.bed file) obtained during parsing and the gene annotation files (specifically, the hg19 .fa and the output from gtfToGenePred). Unfortunately, the command didn't produce any annotated fusion junctions again. I wonder if this is due to how the file obtained in the gtfToGenePred step is structured. The file is being read and there are no issues in the CIRCexplorer code, which was a problem previously when we only provided the GTF file. There seems to be an issue with obtaining annotation information. CIRCexplorer2 annotate -r /mnt/work1/genodata/hg19/Indexed_hg19_STAR/UCSC_geneGTFtoGENEPRED.txt -g /mnt/work1/genodata/hg19/bwa_index/hg19.bowtie.fa -b back_spliced_junction.bed -o circularRNA_Diff0_T17und.txt > CIRCexplorer2_Diff0_T17und_annotate.log

I appreciate your insights on how to address this matter effectively. Thank you.

Best regards,

Cheers,

andre-gabriel-42 avatar Nov 07 '23 16:11 andre-gabriel-42