SimSeq
SimSeq copied to clipboard
An illumina paired-end and mate-pair short read simulator. This project attempts to model as many of the quirks that exist in Illumina data as possible. Some of these quirks include the potential for...
SimSeqNBProject currently contains the net beans project with the initial source
getErrorProfile contains the c source for the program that generates the error model for the sequence simulator. This program utilizes a subset of the kent source library (http://hgwdev.cse.ucsc.edu/~kent/src/) which is freely available for both commercial and non-commercial use (see README in "ThirdParty/kent" for more information)
examples contains some things to run this on. Right now there is an example mitochondrial sequence, and an example error profile from a 100 bp illumina GAIIx dataset. You can simulate any read less than or equal to 100bp in length using that error profile.
usage: java -jar -Xmx2048m SimSeq.jar [required options] [options]
Last Updated: 4.12.2011
-1,--read1_length
===Quick Start Example:=== ==Prerequisits== 0.0. Download and install samtools: http://samtools.sourceforge.net/ and add executable to your $PATH 0.1. Download latest Picard jar file: http://sourceforge.net/projects/picard/files/, and extract .jar files somewhere that makes sense like $HOME/jars/, for the rest of this I will assume that all jar files are located in $HOME/jars/ ==Generate and check simulated reads==
- Download SimSeq.jar, error_profile.txt, and AlligatorMito.fa
- put SImSeq.jar in your jar directory ($HOME/jars/ or something) and put error_profile.txt and AlligatorMito.fa in a folder where you want to work
- Run
java -jar -Xmx2048m $HOME/jars/SimSeq.jar -1 100 -2 100 \ --error hiseq_mito_default_bwa_mapping_mq10_1.txt \ --error2 hiseq_mito_default_bwa_mapping_mq10_2.txt \ --insert_size 3000 --insert_stdev 300 \ --mate_pair --mate_frag 500 --mate_frag_stdev 50 \ --mate_pulldown_error_p 0.3 --read_number 10000 \ --read_prefix AMito_ --reference AlligatorMito.fa \ --duplicate_probability 0.01 --out out.sam - Convert sam output to bam:
samtools view -bS -T AlligatorMito.fa -t AlligatorMito.size -o out.bam out.sam - Sort the bam file:
samtools sort out.bam out.sorted - Index the sorted bam file:
samtools index out.sorted.bam - Check the alignment in tview:
samtools tview out.sorted.bam AlligatorMito.fa==Convert Sam output to Fastq== runjava -jar -Xmx2048 $HOME/jars/SamToFastq.jar INPUT=out.sorted.bam FASTQ=library.1.fastq SECOND_END_FASTQ=library.2.fastq INCLUDE_NON_PF_READS=true VALIDATION_STRINGENCY=SILENT
Note: setting the VALIDATION_STRINGENCY=SILENT is probably not ideal, but for now it is necessary because samtools isn't putting the length of the reference sequence into the header, which is expected by Picard. You get an error about the first read having a coordinate outside of the range of the reference sequence which has a length of 0. If you use Picard to somehow fix the header given the reference sequence this should work without telling it not to validate your sequence.
Also Note: The current sam alignment for a chimeric read shows the fragment starting at the left-most coordinate. Since the rest of the sequence occurs before the left most coordinate I use a soft clipping for the remainder of the sequence. This does not effect the output fastq file generated by the above method, but when you are looking at the alignment in samtools tview you will see some sequences as short as 1 nucleotide. This is normal, and doesn't mean that the actual sam records are that short.
When producing chimeric reads, samtools doesn't currently allow you to view the backwards jumps that occur. Doing so would involve a '-xN' operation on a cigar string. I added a custom tag ('YC:Z:[CIGAR]') at the end of each sam string that shows either a normal cigar string if the read can be represented with one, or a custom cigar string that allows for negative jumps needed to represent chimeric reads if the read is chimeric. Here is an example of what a chimeric read's tag would look like YC:Z:97M-4996N3M while a non-chimeric read would have a normal cigar string in its tag like: YC:Z:100M. To reconstruct a chimeric read, you start at the start coordinate for the read which corresponds to the left most coordinate, then you proceed for the number of matches 'M', then you jump the specified number of positions from your current location, in this case -4996, and finally proceed with the remaining CIGAR string. Right now since I don't add indels to the data, this string will either be a match or one of these chimeric jumps.