TypeTE icon indicating copy to clipboard operation
TypeTE copied to clipboard

Genotyping of segregating mobile elements insertions

TypeTE v1.1

changelog v1.0 --> v1.1

  • Output vcf:
    • Cleanup output vcfs from irrelevant info fields in header
    • Reference genotypes are now printed in the traditionnal (REF/ALT) format, with REF = TE present = 0, and ALT = TE absent (deletion) = 1.
  • Hard code python2.7 in assembly script to match Spades requirements
  • Improve Non-Reference allele reconstruction script at TSD
  • Clean bugs and silence non-threatening error messages
  • Change parameterfile_NoRef.ini to parameterfile_NRef.ini to match regular script naming
  • Create tutorial section (upcoming manuscript)

see the TypeTE paper in NAR (2020)

Purpose

TypeTE is a pipeline dedicated to genotype segregating Mobile Element Insertion (MEI) previously scored with a MEI detection tool such as MELT (Mobile Element Locator Tool, Gardner et al., 2017). TypeTE extracts reads from each detected polymorphic MEI and reconstruct acurately both presence and absence alleles. Eventually, remapping of the reads at the infividual level allow to score the genotype of the MEI using a modified version of Li's et al. genotype likelihood. This method drammatically improves the quality of the genotypes of reported MEI and can be directly used after a MELT run on both non-reference and reference insertions.

picture alt

TypeTE is divided in two modules: "Non-reference" to genotype insertions absent from the reference genome and "Reference" to genotype TE copies present in the reference genomes.

Currently TypeTE is working only with Alu insertions in the human genome but will be soon available for L1, SVA as well as virtualy any retrotransposon in any organism with a reference genome.

This pipeline is developped by Jainy Thomas (University of Utah) and Clement Goubert (Cornell University). Elaborated with the collaboration of Jeffrey M. Kidd (University of Michigan)

Please adress all you questions and comments using the "issue" tab of the repository. This allows the community to search and find directly answers to their issues. If help is not comming, you can email your questions at goubert.clement[at]gmail.com

Installation

Dependencies

A docker container is coming for TypeTE! Stay tuned to get the latest version as soon as it comes out!

TypeTE rely on popular softwares often already in the toolbox of computational biologists! The following programs need to be installed and their path reported in the file "parameterfile_[No]Ref.init" Perl executable must be in the user path

  • PERL https://www.perl.org/
    • BioPerl https://bioperl.org/INSTALL.html
  • PYTHON 2.7 https://www.python.org/download/releases/2.7/ (Not compatible with Python 3)
    • pysam https://github.com/pysam-developers/pysam
  • PARALLEL https://www.gnu.org/software/parallel/
  • PICARD https://broadinstitute.github.io/picard/
  • BEDTOOLS http://bedtools.readthedocs.io/en/latest/
  • SEQTK https://github.com/lh3/seqtk
  • BAMUTILS https://genome.sph.umich.edu/wiki/BamUtil
  • SPADES http://cab.spbu.ru/software/spades/
  • MINIA http://minia.genouest.org/
  • CAP3 http://seq.cs.iastate.edu/cap3.html
  • BLAST ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
  • BWA http://bio-bwa.sourceforge.net/bwa.shtml
  • BGZIP http://www.htslib.org/doc/bgzip.html
  • TABIX http://www.htslib.org/doc/tabix.html

Download and install

  1. Clone from git repository:
git clone --recurse-submodules https://github.com/clemgoub/TypeTE.git
cd TypeTE
  1. Complete the fields associated to the path of each dependent program in the files "parameterfile_Ref.init" and "parameterfile_NRef.init"

  2. And that's it!


Files preparation

You will need:

  1. A vcf/vcf.gz file (VCF) such as generated by the MELT discovery workflow. Examples are available in the folder "test_data". The vcf file must contain on Reference or Non-reference loci according to the module chosen. Loci/individuals must be sampled from the original vcf/vcf.gz using the following flag --recode-INFO-all in vcftools so the subsetted vcf will be compatible with TypeTE. If a new vcf is created specially for TypeTE, the following tags must be present in the "INFO" field (column) for non-reference loci only:
  • MEINFO= with predicted subfamily (Repbase name) and orientation of the TE (ex: MEINFO=AluYa5,.,.,+ | if the subfamily is unknown: MEINFO=AluUndef;.,.,+)
  • TSD= to indicate the predicted TSD (ex: TSD=AATAGAATTAGCAATTTTG | if no TSD detected TSD=null)

example:

##fileformat=VCFv4.1
##<HEADER OF THE VCF FILE>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA07056	NA11830	NA12144
1	72639020	ALU_umary_ALU_244	C	<INS:ME:ALU>	.	.	MEINFO=AluUndef,4,281,-;TSD=AGCAATCTTATTTTC	GT	0|1	0|0	0|1
10 69994906 ALU_umary_ALU_8067 G <INS:ME:ALU> . . MEINFO=AluUndef,8,280,+;TSD=AATAGAATTAGCAATTTTG GT 0|0 0|1 0|1

The "TSD=" and "MEINFO=" might be in different orders in the column "INFO" (8) of the vcf without issue. These fields are not required for the Reference module where these will be extracted from the reference genome

  1. bam files for each individual found in the vcf file

  2. a two column tab separated table with the sample name and corresponding bam name (BAMFILE):

sample1 sample1-xxx-file.bam
sample2 sample2-yyy-file.bam
sample3 sample3-zzz-file.bam
  1. Reference genome (GENOME) in fasta format (to date tested with hg19 and hg38). In another reference genome is used, you will need to update the RepeatMasker track corresponding to your reference as well as the repeat you want to genotype.

  2. RepeatMasker Track a .bed files reporting each reference MEI insertion masked by RepeatMasker for the reference sequence provided. The family names must match the names of the consensus given in the RM_FASTA field. (provided by default for Alu on hg19 and hg38)

  3. RepeatMasker Consensus (RM_FASTA) a .fasta file with the consensus sequences of the repeats analysed (provided by default for Alu)

  4. Edit the file "parameterfile_NoRef.init" or "parameterfile_Ref.init" following the indications within:

### MAIN PARAMETERS

# user data
VCF="/workdir/cg629/bin/TypeTE/test_data/test_data_nonref.vcf" #Path to MELT vcf (.vcf or .vcf.gz) must contain INFO field with TSD and MEI type
BAMPATH="/workdir/cg629/Projects/TypeTE_tutorial/test_data/" # Path to the bams folder
BAMFILE="/workdir/cg629/bin/TypeTE/test_data/input_table.txt" # <indiv_name> <bam_name> (2 fields tab separated table)

# genome data
RM_TRACK="/workdir/cg629/bin/TypeTE/Ressources/RepeatMasker_Alu_hg19.bed" # set by default for hg19
RM_FASTA="/workdir/cg629/bin/TypeTE/Ressources/refinelib" # set by default to be compatible with the Repeat Masker track included in the package
GENOME="/workdir/cg629/Projects/testTypeTE/hs37d5.fa" # Path the the reference genome sequence

# output
OUTDIR="/workdir/cg629/Projects/TypeTE_tutorial" # Path to place the output directory (will be named after PROJECT); OUTDIR must exist
PROJECT="OUTPUTS_NRef_testdata" # Name of the project (name of the folder)

# multi-threading
individual_nb="1" # number of individual per job (try to minimize that number)
CPU="3" # number of CPU (try to maximize that number) # CPU x individual_nb >= total # of individuals

## non-mendatory parameters
MAP="NO" #OR NO (experimental)

### DEPENDENCIES PATH
# /!\ PERL MUST BE IN PATH /!\
PARALLEL="/programs/parallel/bin/parallel" #Path to the GNU Parallel program
PICARD="/programs/picard-tools-2.9.0" #Path to Picard Tools
BEDTOOLS="/programs/bedtools-2.27.1/bin/bedtools" #Path to bedtools executable
SEQTK="/programs/seqtk" #Path to seqtk executable
BAMUTILS="/programs/bamUtil" #Path to bamUtil
SPADES="/programs/spades-3.5.0/bin" #Path to spades bin directory (to locate spades.py and dispades.py)
MINIA="/workdir/cg629/bin/minia/build/bin" #Path to minia bin directory
CAP3="/workdir/cg629/bin/CAP3" #Path to CAP3 directory
BLAST="/programs/ncbi-blast-2.7.1+/bin" #Path to blast bin directory
BWA="/programs/bwa-0.5.9/bwa" #Path to bwa executable
BGZIP="bgzip" #Path to bgzip executable
TABIX="tabix" #Path to tabix executable

Running TypeTE

  1. Fill the appropriated parameterfile_[N]Ref.init according to your local paths and files
  2. Run the following command in the TypeTE folder:
nohup ./run_TypeTE_[N]Ref.sh &> TypeTE.log &

Use ./run_TypeTE_Ref.sh for reference insertions and ./run_TypeTE_NRef.sh for non-reference insertions.


Output

TypeTE outputs a vcf.gz file containing all individual genotypes with genotypes likelihoods. The vcf convention reports genotypes relative to the allele present in the reference genome, thus TypeTE reports Reference insertions as 0/0 (homozygous) or (0/1), with 1/1 genotypes being homozygous for the absence of TE. This pattern is the opposite for the Non-Reference insertions.

Test runs

Non-reference insertions

We have prepared a small tutorial/test-run to check if all the components of TypeTE works perfectly.

We are going to run the pipeline on 2 loci of 3 individuals from the 1000 Genome Project.

  1. Download the bam and bam.bai files

Within the TypeTE folder, type:

cd test_data
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA07056/alignment/NA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA07056/alignment/NA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam.bai
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA11830/alignment/NA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA11830/alignment/NA11830.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.bam.bai
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12144/alignment/NA12144.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12144/alignment/NA12144.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam.bai

The corresponding bam/bam.bai files will be downladed into /TypeTE/test_data

  1. Copy the parameterfile_NoRef.init template present in /TypeTE/test_data to the main folder
cp parameterfile_NRef.init ../
cd ../
  1. Edit the parameterfile_NRef.init according to your dependancies and local path.

  2. Run TypeTE

nohup ./run_TypeTE_NR.sh &> TypeTE_TESTRUN.log &
  1. Expected results

The genotypes from the original vcf (<>/TypeTE/test_data/test_data_nonref.vcf) are the following

NA07056 NA11830 NA12144
1_72639020 0/1 0/0 0/1
10_69994906 0/0 0/1 0/1

The new genotypes should be

NA07056 NA11830 NA12144
1_72639020 1/1 0/1 0/1
10_69994906 0/0 1/1 0/1

Reference-insertions

We will here genotype two reference loci in the same three individuals:

  1. Copy the parameterfile_Ref.init present in /TypeTE/test_data to the main folder
cp test_data/parameterfile_Ref.init .
  1. Edit the parameterfile_Ref.init according to your dependancies and local path (but do not change anything else!)

  2. Run TypeTE

nohup ./run_TypeTE_Ref.sh &> TypeTE_TESTRUN_ref.log &
  1. Expected results

The genotypes from the original vcf (<>/TypeTE/test_data/test_data_ref.vcf) are the following

NA07056 NA11830 NA12144
5_88043130 0/1 1/1 0/1
6_7717368 0/1 0/1 0/1

The new genotypess should be

NA07056 NA11830 NA12144
5_88043130 1/1 0/1 0/1
6_7717368 1/1 1/1 0/1