seq-frontiers-class
seq-frontiers-class copied to clipboard
Code, examples, reading list for JHU Frontiers of Sequencing Data Analysis class
Class sessions
At the beginning of class, I will lead some informal discussions covering:
- Introduction to geonmics, computational genomics, and DNA sequencing
- Introduction to RNA sequencing data analysis
- Introduction to big-data methods for sequencing data analysis
- Project ideas
You'll give a work-in-progress presentation on your final project sometime around spring break. You'll give a final project presentation at the end of the semester. A couple guest lectures are likely as well.
Otherwise, class sessions will be devoted to discussing literature. A student will be selected ahead of time to present in each class, and the student will select and announce 1 or 2 papers for discussion. The student will then lead a 60-to-75-minute discussion of those papers in class. The student should present the main ideas and results of the paper using some combination of slides, chalkboard, and demonstrations. Everyone is encouraged to participate in every discussion; participation is an important part of your grade.
Readings
If you are taking my class and you have any trouble accessing these resources, please contact me. All of these articles should be easily accessible from the JHU campus or via VPN / library proxy.
Broad surveys
- Life and its Molecules by Lawrence Hunter
- A decade's perspective on DNA sequencing technology by Elaine Mardis
- Sequencing technologies -- the next generation by MichaelMetzker
- The DNA Data Deluge by Schatz, Langmead
RNA sequencing data analysis
-
Surveys
- RNA-Seq: a revolutionary tool for transcriptomics by Wang, Gerstein, Snyder
- RNA sequencing: advances, challenges and opportunities by Ozsolak, Milos
- Computational methods for transcriptome annotation and quantification using RNA-seq by Garber, Grabherr, Guttman and Trapnell
- From RNA-seq reads to differential expression results by Oshlack, Robinson, Young
-
Spliced alignment
- QPALMA: Optimal spliced alignments of short sequence reads by De Bona et al
- MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery by Wang et al
- TopHat: discovering splice junctions with RNA-seq by Trapnell, Pachter, Salzberg
- TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions by Kim et al
- STAR: ultrafast universal RNA-seq aligner by Dobin et al
-
Assembly
- Cufflinks: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation by Trapnell et al
- CLASS: CLASS: constrained transcript assembly of RNA-seq reads by Song, Florea
- Trinity: Full-length transcriptome assembly from RNA-Seq data without a reference genome by Grabherr et al
- PSGInfer: Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs by LeGault, Dewey
-
Scalable methods
- eXpress: Streaming fragment assembly for real-time analysis of sequencing experiments by Roberts, Pachter (also listed below)
- Fragment assignment in the cloud with eXpress-D by Roberts, Feng, Pachter (also listed below)
- Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms by Patro, Mount, Kingsford
-
Comparisons
- Systematic evaluation of spliced alignment programs for RNA-seq data by Engstrom et al
- Assessment of transcript reconstruction methods for RNA-seq by Steijger et al
Big data methods for sequencing data analysis
-
Indexing
-
Surveys
- Prospects and limitations of full-text index structures in genome analysis by Vyverman et al
- Indexing Methods for Approximate String Matching by Navarro et al
- Introduction to the Burrows-Wheeler Transform and FM Index by Langmead
-
Types of indexes
- Suffix array: Suffix arrays: a new method for on-line string searches by Manber, Myers
- Enhanced suffix array: Replacing suffix trees with enhanced suffix arrays by Abouelhoda, Kurtz, Ohlebusch
- FM index: Opportunistic data structures with applications by Ferragina, Manzini
-
Sequencing read alignment tools that use the Suffix Array
-
Sequencing read alignment tools that use the FM Index
- Bowtie: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome by Langmead et al
- Bowtie 2: Fast gapped-read alignment with Bowtie 2 by Langmead, Salzberg
- BWA: Fast and accurate short read alignment with Burrows–Wheeler transform by Li, Durbin
- BWA-SW: Fast and accurate long-read alignment with Burrows–Wheeler transform by Li, Durbin
- BWA-MEM: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM by Li
- GEM: The GEM mapper: fast, accurate and versatile alignment by filtration by Marco-Sola et al
-
Indexing and querying large collections of sequencing reads using a suffix array and/or FM Index
- Gk-array: Querying large read collections in main memory: a versatile data structure by Philippe et al
- CRAC: an integrated approach to the analysis of RNA-seq reads by Philippe et al
- Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform by Cox et al
- GPU-Accelerated BWT Construction for Large Collection of Short Reads by Liu, Luo, Lam
- ropeBWT2 by Li (no paper, just a GitHub repo with a README)
-
-
Compression
- CRAM: Efficient storage of high throughput DNA sequencing data using reference-based compression by Hsi-Yang Fritz et al
- Quip: Compression of next-generation sequencing reads aided by highly efficient de novo assembly by Jones et al
- Adaptive reference-free compression of sequence quality scores by Janin, Rosone, Cox
- SCALCE: boosting sequence compression algorithms using locally consistent encoding by Hach et al
- Compression of FASTQ and SAM Format Sequencing Data by Bonfield, Mahoney
- QualComp: a new lossy compressor for quality scores based on rate distortion theory by Ochoa et al
-
Sketching and streaming
- Background
- eXpress: Streaming fragment assembly for real-time analysis of sequencing experiments by Roberts, Pachter (also listed above)
- Similarity Estimation Techniques from Rounding Algorithms by Charikar
- A random-permutations-based approach to fast read alignment by Lederman
- Efficient counting of k-mers in DNA sequences using a bloom filter by Melsted, Pritchard
- DSK: k-mer counting with very low memory usage by Rizk, Lavenier, Chikhi
- Oculus: faster sequence alignment by streaming read compression by Veeneman, Iyer, Chinnaiyan
- These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure by Zhang et al
-
Scalable computing
-
Frameworks
- MapReduce: simplified data processing on large clusters by Dean and Ghemawat
- Spark:
- Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing by Zaharia et al
- Spark: cluster computing with working sets by Zaharia et al
-
Scalable alignment
-
Scalable indexing
- Rapid Parallel Genome Indexing with MapReduce by Menon, Bhat, Schatz
- Scalable Parallel Suffix Array Construction by Kulla and Sanders
- Parallel Suffix Sorting based on Bucket Pointer Refinement by Mohamed and Abouelhoda
-
Other scalable tools
- Searching for SNPs with cloud computing by Langmead et al
- Cloud-scale RNA-sequencing differential expression analysis with Myrna by Langmead, Hansen, Leek
- ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing by Massie et al
- Fragment assignment in the cloud with eXpress-D by Roberts, Feng, Pachter (also listed above)
- WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis by Zerbino et al
-
Project resources
See the EN 600.439/639 page for many resources relevant to your final project, including:
- iPython notebooks describing genomics file formats and how to parse them in Python
- iPython notebooks demonstrating algorithms and data structures used in genomics
- Other resources for that class, including readings, videos, review materials
I've also started to post some relevant lecture notes on my lab website.