serratus icon indicating copy to clipboard operation
serratus copied to clipboard

Migrate assembly data to lovelywater

Open ababaian opened this issue 5 years ago • 23 comments

We need to migrate all the assembly and annotation data generated as part of Serratus to our data-lake in a structured way so as to allow for programmatic access. This is a proposed folder hierarchy to discuss wherewe have $SRA as the accession-variable

Similar to the rest of the archive, I propose 'flat' folders broken up by major category and every file contains a $SRA prefex. So no contig/$SRA/$SRA.data.fa or contig/$SRA/data.tsv cases.

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments
├ cov_index.tsv       # Index file of CoV+ libraries
└ assembly_index.tsv  # Index file of assembled SRA libraries

assembly/cov/$SRA.cov.fa : Contigs identified to be CoV (i.e. 12K paper is based on)

  • Currently in : s3://serratus-public/assemblies/contigs/
  • Do not include 0B or empty files

contigs/ : The coronaSPAdes output files such as $SRA.inputdata.txt, $SRA.coronaspdes.txt, $SRA.coronaspdes.gene_clusters.fa ... $SRA.coronaspdes.assembly_graph_with_scaffolds.gfa.gz

  • Currently as s3://serratus-public/assemblies/other/$SRA.coronaspades/$SRA...
  • Remove $SRA.coronaspades/ intermediate folder

annotation/

  • Currently as s3://serratus-public/assemblies/annotations/

gz/ : I was originally thinking of also storing the data as a single $SRA.tar.gz file containing cov/ contig/ and annotation/ data but this will duplicate the data and is probably not a good idea. Instead we can provide a short grabSRA.sh $SRA script which will automatically download all the files associated with a particular $SRA to the local system for users.

ababaian avatar Dec 09 '20 20:12 ababaian

it's all staged in s3://serratus-rayan/lovelywater/assembly, please have a look before transferring to lovelywater.

Name Size
annotation/ 73.8 GB
cov/ 169.2 MB
contigs/ 4.0 TB

rchikhi avatar Jan 20 '21 14:01 rchikhi

TODO for me next:

  • quenya, dicistro, satellites CS assemblies into contigs/
  • update access data release page

rchikhi avatar Jan 20 '21 22:01 rchikhi

The README.md in the top-level of lovelywater is out-of-sync with the bucket directory structure.

taltman avatar Jan 22 '21 21:01 taltman

Most recent version is always on the Data Access Page

ababaian avatar Jan 22 '21 21:01 ababaian

That page is also inconsistent. In Naming Conventions, it uses as an example, s3://lovelywater/contig/SRA123456.fa. In the Folder Organization section, there is no such folder contig, and there is no such directory in the bucket (as far as I can see).

taltman avatar Jan 22 '21 22:01 taltman

The data for assemblies has not been migrating on it, once that's done it closes this issue.

edit: updated the access page to reflect situation on the ground

ababaian avatar Jan 22 '21 22:01 ababaian

Satellites assemblies have been migrated, to s3://serratus-rayan/lovelywater/assembly/contigs i.e. same location as other CoV assembly data. For some reason, I can't find satellites' scaffolds.fasta files, only the gene_clusters.fasta are present. I tend to think I might have never copied scaffolds.fasta to S3 (likely due to a past bug that has recently been fixed) and it's likely that we were only interested in gene_clusters.fasta during the satellite analysis.

rchikhi avatar Feb 27 '21 21:02 rchikhi

c'est la vie. Is this the complete collection of assemblies then?

ababaian avatar Feb 27 '21 21:02 ababaian

nope, i'm in the process of moving dicistro/quenya assemblies too, will let you know when it's over

rchikhi avatar Feb 27 '21 21:02 rchikhi

done! dicistro, quenya, satellites assemblies are copied.

total number of accessions assembled in s3://serratus-rayan/lovelywater/assembly/contigs: 56,071 total size of ̀s3://serratus-rayan/lovelywater/: 4.9 TB scaffolds from CoV assemblies (MFC-compressed): 0.9 TB scaffolds from other assemblies (gzip-compressed): 0.2 TB assembly graphs (gzip-compressed): 1.6 TB (These could be deleted, but at the same time keeping them would enable to quickly regenerate assemblies e.g. after a coronaSPAdes update, or to get the missing scaffolds.fasta files)

Darth annotations of checkv-filtered gene_clusters (gzip-compressed): 2.0 TB Some of those somehow made their way to the contigs/ folder. Among these, some contain a huge BAM file of reads aligned to contigs, hence the space usage. This was needed for quality control. They could be deleted, as for each of those there is another gzip file without the BAM file. Two options:

  1. delete the large BAM-containing Darth archives and move the small ones to into annotation/ folder
  2. keep everything and move all darth stuff to annotation/ folder any preference?

rchikhi avatar Feb 28 '21 10:02 rchikhi

Also there is the 1k subset of accession assemblies found by the .pro analysis, wanna include it?

rchikhi avatar Feb 28 '21 11:02 rchikhi

yes

ababaian avatar Feb 28 '21 18:02 ababaian

1ksubset: migration done

rchikhi avatar Feb 28 '21 19:02 rchikhi

after some Slack discussions:

  • darth data inside contigs/ has been deleted as it's mainly redundant with the one aleady in ̀annotation/ except for huge BAM files.
  • serratax/serraplace stuff inside contigs/ has been moved to annotation/

so I think we're done

rchikhi avatar Mar 01 '21 08:03 rchikhi

hold on, i'll also move checkV analysis from contigs/ to annotation/

rchikhi avatar Mar 01 '21 08:03 rchikhi

done! Here's the final content of

s3://lovelywater/     # A Read-Only Archive of Serratus Data Releases
├── assembly/         # Viral assembly and annotation data
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...
│   └─── annotation/  # CoV annotation and taxonomic assignments

as staged in s3://serratus-rayan/lovelywater/assembly/.

assembly/cov:

These are the 11,120 coronavirus assemblies made with coronaSPAdes, where contigs have been filtered either using CheckV or using coronaSPAdes' bgc-statistics. See Serratus' manuscript for more details.

assembly/contigs:

SRRXXXXXX.[assembler].assembly_graph_with_scaffolds.gfa.gz
SRRXXXXXX.[assembler].bgc_statistics.txt
SRRXXXXXX.[assembler].contigs.fa.mfc
SRRXXXXXX.[assembler].domain_graph.dot
SRRXXXXXX.[assembler].gene_clusters.fa
SRRXXXXXX.[assembler].scaffolds.fasta.gz
SRRXXXXXX.[assembler].scaffolds.paths
SRRXXXXXX.[assembler].log
SRRXXXXXX.[assembler].txt

All of these are [assembler] outputs, where [assembler] is either coronaSPAdes or rnaviralSPAdes. Depending on the assembler, a subset of these files will be present for each accession. Beware: contigs.fa.mfc actually contains the content of coronaSPAdes' scaffolds.fasta compressed with MFCompress.

assembly/annotation:

This folder contains the annotation results of several programs applied to different inputs.

CheckV applied to the scaffolds.fasta and/or gene_clusters.fasta:

SRRXXXXXX.[assembler].checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].checkv.quality_summary.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.completeness.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.contamination.tsv.gz
SRRXXXXXX.[assembler].gene_clusters.checkv.quality_summary.tsv.gz

serraplace (taxonomic placement) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serraplace.tar.gz
SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.final

serratax (taxonomic identification) output of CheckV-filtered gene clusters:

SRRXXXXXX.[assembler].gene_clusters.checkv_filtered.fa.serratax.tar.gz

Then, the following are annotations of the assemblies in cov/. They include the outputs of Darth, a pipeline created within Serratus for annotation of coronavirus assemblies.

SRRXXXXXX.fa.darth.alignments.fasta
SRRXXXXXX.fa.darth.alignments.sto
SRRXXXXXX.fa.darth.input_md5
SRRXXXXXX.fa.darth.stripped.tar.gz
SRRXXXXXX.fa.darth.tar.gz
SRRXXXXXX.fa.darth.transeq.alignments.fasta
SRRXXXXXX.fa.serraplace.tar.gz
SRRXXXXXX.fa.serratax.final
SRRXXXXXX.fa.serratax.tar.gz

rchikhi avatar Mar 02 '21 20:03 rchikhi

I'll begin data migration shortly!

ababaian avatar Mar 02 '21 20:03 ababaian

Take a look at s3://lovelywater/assembly/ and let me know if that looks alright.

Also updated the

If that looks good then close this baby!

ababaian avatar Mar 04 '21 22:03 ababaian

What's the status on this? Should I be pulling data from s3://serratus-rayan/lovelywater/assembly/cov/ or s3://lovelywater/assembly/cov/?

taltman avatar Mar 10 '21 01:03 taltman

either is fine they are identical. Migration is now complete. I think we're good to close this @rchikhi

ababaian avatar Mar 10 '21 02:03 ababaian

Same number of files and size as my folder, looks good

Total Objects: 671859
   Total Size: 3.2 TiB

rchikhi avatar Mar 10 '21 11:03 rchikhi

so, this issue is closed yet I noticed today that we never deleted anything off the original location s3://serratus-public/assemblies (thought the staged location s3://serratus-rayan/lovelywater got correctly cleared). The original location still contains all the migrated data + some other less useful and non-migrated accessions, like those with partially failed assemblies, a few minia assemblies that coronaspades didn't assemble, etc. I see 48268 coronaspades assemblies on lovelywater and 51756 coronaspades folders on serratus-public (with possibly empty in some cases). @ababaian, a few options:

  1. delete from s3://serratus-public/assemblies only the migrated stuff
  2. delete everything from s3://serratus-public/assemblies
  3. keep s3://serratus-public/assemblies for some reason

I'd go for 1)

rchikhi avatar Jun 25 '21 21:06 rchikhi

One consideration is serratus-public currently has version control, so you have to do a 2-pass deletion (delete file, and delete history) to remove data. We do need to do this this but I've been delaying until the paper is "done" so we don't whoopsy and lose some data we need. I'll re-open and let's go with (2) once the paper is done is my take. I'll reopen the issue.

ababaian avatar Jun 29 '21 12:06 ababaian