Pisces icon indicating copy to clipboard operation
Pisces copied to clipboard

GenomeSize.xml needed for hs37d5 with PhiX

Open chrausch opened this issue 6 years ago • 7 comments

Dear Tamsen,

we use hs37d5 as reference genome which is based on hg19 but has additional decoy sequences plus the PhiX genome added by us. (See https://wiki.dnanexus.com/scientific-notes/human-genome#The-%22b37+decoy%22-/-%22hs37d5%22-extensions-(by-the-1000-Genomes-Project-Phase-II)).

For running Pisces we need the GenomeSize.xml file. Would you have code to make it? It would be tedious to write code to make such xml especially I assume that you or someone must have that already.

Thanks! Christian.

chrausch avatar Feb 09 '18 10:02 chrausch

Hi Christian,

That's a really good point. I'll add a tool to do that properly as soon as I have a chance.

Meanwhile, here is the old code I have lying around, below. You can add it to the GenomeMetadata.cs class in Common.IO. It probably wont quite compile. Needs to be updated to .net core 2.0, but it should be really close, and should at least save you starting from scratch.

    /// <summary>
    /// Serializes the genome metadata to an XML file
    /// </summary>
    public void Serialize(string outputFilename)
    {
        // open the XML file
        // Initialize with StreamWriter to avoid URI parsing of input filename which doesn't 
        // like # in the path
        using (XmlWriter xmlWriter = new XmlWriter(new StreamWriter(outputFilename, false, Encoding.ASCII)))
        {
            xmlWriter.Formatting = Formatting.Indented;
            xmlWriter. = '\t';
           xmlWriter.Indentation = 1;

            // write all of our sequences
            xmlWriter.WriteStartElement("sequenceSizes");
            if (!string.IsNullOrEmpty(Name)) xmlWriter.WriteAttributeString("genomeName", Name);

            foreach (SequenceMetadata refSeq in Sequences)
            {
                xmlWriter.WriteStartElement("chromosome");

                // required for compatibility with CASAVA 1.8
                xmlWriter.WriteAttributeString("fileName", Path.GetFileName(refSeq.FastaPath));
                xmlWriter.WriteAttributeString("contigName", refSeq.Name);
                xmlWriter.WriteAttributeString("totalBases", refSeq.Length.ToString());

                // additional attributes for MiSeq
                if (!string.IsNullOrEmpty(refSeq.Build)) xmlWriter.WriteAttributeString("build", refSeq.Build);
                xmlWriter.WriteAttributeString("isCircular", (refSeq.IsCircular ? "true" : "false"));
                if (!string.IsNullOrEmpty(refSeq.Checksum)) xmlWriter.WriteAttributeString("md5", refSeq.Checksum);
                xmlWriter.WriteAttributeString("ploidy", refSeq.Ploidy.ToString());
                if (!string.IsNullOrEmpty(refSeq.Species))
                    xmlWriter.WriteAttributeString("species", refSeq.Species);
                xmlWriter.WriteAttributeString("knownBases", refSeq.KnownBases.ToString());
                xmlWriter.WriteAttributeString("type", refSeq.Type.ToString());

                xmlWriter.WriteEndElement();
            }
            xmlWriter.WriteEndElement();

            // close the XML file
            xmlWriter.Close();
        }
    }

Thanks for raising the issue!

tamsen avatar Feb 09 '18 20:02 tamsen

FYI, if you use samtools -faidx genome.fa, that will generate GenomeSize.xml

arnoldliao avatar Mar 02 '18 17:03 arnoldliao

nope, looks like not. I'll post a tool as soon as I get a chance.

http://www.htslib.org/doc/samtools.html

faidx samtools faidx <ref.fasta> [region1 [...]]

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.

The input file can be compressed in the BGZF format.

The sequences in the input file should all have different names. If they do not, indexing will emit a warning about duplicate sequences and retrieval will only produce subsequences from the first sequence with the duplicated name.

tamsen avatar Mar 02 '18 20:03 tamsen

Hi Christian,

https://github.com/Illumina/Pisces/releases/tag/v5.2.7.47 CreateGenomeSizeFile_5.2.7.47.tar.gz

tamsen avatar Apr 11 '18 19:04 tamsen

Hello @tamsen, thanks for the tool, I am generating my own GenomeSize file :).

One appointment, the binary file updated at git repository seems to point to https://github.com/Illumina/Pisces/blob/master/binaries/5.2.7.47/CreateGenomeSizeFile_5.2.7.47.tar.gz VennVCF library

Thank you!

lpalomerol avatar Aug 01 '18 12:08 lpalomerol

Hi @tamsen , any chance you could add the correct tarball to the repo for CreateGenomeSizeFile?

CreateGenomeSizeFile_5.2.9.122.tar.gz is an empty file, and CreateGenomeSizeFile_5.2.7.47.tar.gz contains VennVCF, not CreateGenomeSizeFile.

astewart-twist avatar Dec 14 '18 21:12 astewart-twist

Hi, should be updated.. Please let me know if any weirdness continues. Thanks! https://github.com/Illumina/Pisces/issues/23

tamsen avatar Dec 17 '18 18:12 tamsen