Pisces
Pisces copied to clipboard
GenomeSize.xml needed for hs37d5 with PhiX
Dear Tamsen,
we use hs37d5 as reference genome which is based on hg19 but has additional decoy sequences plus the PhiX genome added by us. (See https://wiki.dnanexus.com/scientific-notes/human-genome#The-%22b37+decoy%22-/-%22hs37d5%22-extensions-(by-the-1000-Genomes-Project-Phase-II)).
For running Pisces we need the GenomeSize.xml file. Would you have code to make it? It would be tedious to write code to make such xml especially I assume that you or someone must have that already.
Thanks! Christian.
Hi Christian,
That's a really good point. I'll add a tool to do that properly as soon as I have a chance.
Meanwhile, here is the old code I have lying around, below. You can add it to the GenomeMetadata.cs class in Common.IO. It probably wont quite compile. Needs to be updated to .net core 2.0, but it should be really close, and should at least save you starting from scratch.
/// <summary>
/// Serializes the genome metadata to an XML file
/// </summary>
public void Serialize(string outputFilename)
{
// open the XML file
// Initialize with StreamWriter to avoid URI parsing of input filename which doesn't
// like # in the path
using (XmlWriter xmlWriter = new XmlWriter(new StreamWriter(outputFilename, false, Encoding.ASCII)))
{
xmlWriter.Formatting = Formatting.Indented;
xmlWriter. = '\t';
xmlWriter.Indentation = 1;
// write all of our sequences
xmlWriter.WriteStartElement("sequenceSizes");
if (!string.IsNullOrEmpty(Name)) xmlWriter.WriteAttributeString("genomeName", Name);
foreach (SequenceMetadata refSeq in Sequences)
{
xmlWriter.WriteStartElement("chromosome");
// required for compatibility with CASAVA 1.8
xmlWriter.WriteAttributeString("fileName", Path.GetFileName(refSeq.FastaPath));
xmlWriter.WriteAttributeString("contigName", refSeq.Name);
xmlWriter.WriteAttributeString("totalBases", refSeq.Length.ToString());
// additional attributes for MiSeq
if (!string.IsNullOrEmpty(refSeq.Build)) xmlWriter.WriteAttributeString("build", refSeq.Build);
xmlWriter.WriteAttributeString("isCircular", (refSeq.IsCircular ? "true" : "false"));
if (!string.IsNullOrEmpty(refSeq.Checksum)) xmlWriter.WriteAttributeString("md5", refSeq.Checksum);
xmlWriter.WriteAttributeString("ploidy", refSeq.Ploidy.ToString());
if (!string.IsNullOrEmpty(refSeq.Species))
xmlWriter.WriteAttributeString("species", refSeq.Species);
xmlWriter.WriteAttributeString("knownBases", refSeq.KnownBases.ToString());
xmlWriter.WriteAttributeString("type", refSeq.Type.ToString());
xmlWriter.WriteEndElement();
}
xmlWriter.WriteEndElement();
// close the XML file
xmlWriter.Close();
}
}
Thanks for raising the issue!
FYI, if you use samtools -faidx genome.fa, that will generate GenomeSize.xml
nope, looks like not. I'll post a tool as soon as I get a chance.
http://www.htslib.org/doc/samtools.html
faidx samtools faidx <ref.fasta> [region1 [...]]
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.
The input file can be compressed in the BGZF format.
The sequences in the input file should all have different names. If they do not, indexing will emit a warning about duplicate sequences and retrieval will only produce subsequences from the first sequence with the duplicated name.
Hi Christian,
https://github.com/Illumina/Pisces/releases/tag/v5.2.7.47 CreateGenomeSizeFile_5.2.7.47.tar.gz
Hello @tamsen, thanks for the tool, I am generating my own GenomeSize file :).
One appointment, the binary file updated at git repository seems to point to https://github.com/Illumina/Pisces/blob/master/binaries/5.2.7.47/CreateGenomeSizeFile_5.2.7.47.tar.gz VennVCF library
Thank you!
Hi @tamsen , any chance you could add the correct tarball to the repo for CreateGenomeSizeFile?
CreateGenomeSizeFile_5.2.9.122.tar.gz is an empty file, and CreateGenomeSizeFile_5.2.7.47.tar.gz contains VennVCF, not CreateGenomeSizeFile.
Hi, should be updated.. Please let me know if any weirdness continues. Thanks! https://github.com/Illumina/Pisces/issues/23