simpleaf icon indicating copy to clipboard operation
simpleaf copied to clipboard

CHM13 Index

Open AndrewSkelton opened this issue 1 year ago • 0 comments

Hi all,

Thanks a lot for building such an excellent eco system of tools, your hard work is very appreciated by everyone in the community :)

I've been checking out this pre-print, and wanted to build a CHM13 af index. This has sent me down a rabbit hole and hope you could advise or give an opinion if I'm going in the complete wrong direction.

I'm using the bioconda release of simpleaf:

alevin-fry v0.8.2
piscem v0.6.0
salmon v1.10.2
simpleaf v0.14.1

I've checked out the links in the CHM13 github repo, which offers a number of genomic fasta options and gff files. The paper uses the chm13v2.0_maskedY.fa fasta and GENCODE 35 annotations from this repo.

Converting the GFF3 file to a friendly format which encompasses all the fields needed has been tricky, but a bit of gffread magic helped:

gunzip -c CHM13_Gencode_v38.gff3.gz \
| gffread -F -T -o - \
| sed 's/gene://g; s/transcript://g' \
| gzip > CHM13_Gencode_v38.gtf.gz

To try and mimic annotation as close to expected though, I took the CellRanger human GTF and used liftover which the provided chain file

liftover -gff refdata-gex-GRCh38-2020-A/genes/genes.gtf grch38-chm13v2.chain CHM13_CellRanger.gtf CHM13_CellRanger.unmapped.gtf
pigz CHM13_CellRanger.gtf CHM13_CellRanger.unmapped.gtf grch38-chm13v2.chain

Using CHM13_CellRanger.gtf.gz and chm13v2.0_maskedY.fa as input to simpleaf index resulted in an error suggesting that some exons for a given transcript were on different chromosomes or strands. So a little python to pick those out

# check_strands.py
import pandas as pd
# Define the column names for a GTF file
col_names = ["seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attributes"]
# Read the GTF file
df = pd.read_csv('CHM13_CellRanger.gtf', sep='\t', comment='#', names=col_names)
# Extract the transcript_id from the attributes column
df['transcript_id'] = df['attributes'].str.extract('transcript_id "([^"]+)"')
# Group by transcript_id and check if there is more than one unique strand per group
grouped = df.groupby('transcript_id')['strand'].nunique()
multi_strand_transcripts = grouped[grouped > 1]
print(multi_strand_transcripts)
pigz -d CHM13_CellRanger.gtf.gz
python check_strands.py | grep ENST - | cut -f 1 -d ' ' | sort | uniq > strand.tsv
grep -Ff strand.tsv CHM13_CellRanger.gtf | awk -v FS="\t" '{split($9,a,"; "); for(i in a) {if(a[i] ~ /gene_name/) {print a[i]}}}' | awk -F\" '{print $2}' | sort | uniq > gene_names.tsv
## Genes with liftover strand issues
> cat gene_names.tsv
C5orf60
CCDC200
CDR2
DAZ1
DAZ2
DAZ4
GPAT2
IFITM2
LINC01410
NPIPB8
PRAMEF18
TRIM49D1

I then opted to remove those transcripts from the GTF

grep -v -F -f strand.tsv CHM13_CellRanger.gtf > CHM13_CellRanger.filtered.gtf
pigz CHM13_CellRanger.gtf CHM13_CellRanger.filtered.gtf

This resulted in a new error on the index build, around the lower case letters in the genomic fasta file (ie the softmasking). For lack of better options, I uppercased.

pigz -d chm13v2.0_maskedY.fa.gz
awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' chm13v2.0_maskedY.fa > chm13v2.0_maskedY_upper.fa

...and then the index builds successfully:

simpleaf index \
            --output ./Index_SimpleAF_CHM13 \
            --fasta chm13v2.0_maskedY_upper.fa \
            --gtf CHM13_CellRanger.filtered.gtf.gz \
            --rlen 91 \
            --threads 8 \
            --use-piscem
tree Index_SimpleAF_CHM13 > Index_SimpleAF_CHM13.tree

Any thoughts or suggestions are welcome, or if I've missed an existing build somewhere, a point in the right direction would be greatly appreciated.

Thanks,

Andrew

AndrewSkelton avatar Aug 01 '23 10:08 AndrewSkelton