spades icon indicating copy to clipboard operation
spades copied to clipboard

Looking for PlasmidDatabase and non-Plasmid Contigs dataset

Open sxh1136 opened this issue 4 years ago • 2 comments

Hi there,

I'm looking for the two datasets named in the title. This link - http://data.cab.spbu.ru/index.php/s/tz7mCqDipgbcsbW - shown in the paper - https://genome.cshlp.org/content/29/6/961.full, results in a file not found error.

Hope all is well, Thanks

sxh1136 avatar Jul 03 '20 13:07 sxh1136

Assigning the first author of metaplasmidSPAdes paper. He will certainly be able to help you.

asl avatar Jul 03 '20 13:07 asl

Hi. Thank you for noticing that the link is down, we'll fix it.

However we did not upload these databases - by this link we shared only metaplasmidSPAdes' results and not contigs used for plasmidVerify training/testing.

The construction of these databases is described in the text (PlasmidDatabase data set containing all 9937 plasmids from the RefSeq database (total length 1007 Mb) and the nonPlasmidDatabase data set containing a randomly selected 10% of complete bacterial chromosomes from RefSeq (837 bacterial genomes with total length 3229 Mb).), then they were randomly splitted (70/30) both bases to training/testing datasets and nonPlasmid testing ones were additionally splitted to 10kb chunks for better representation of real fragmented assemblies.

We can share the exact files we used, but if you want to retrain plasmidVerify I'd recommend to use more data - RefSeq plasmid DB was extended since we started the development

Dmitry-Antipov avatar Jul 08 '20 22:07 Dmitry-Antipov