proteinfold icon indicating copy to clipboard operation
proteinfold copied to clipboard

Convert Stockholm format MSAs to more efficient a3m format before emitting to publishdir

Open tlitfin-unsw opened this issue 4 months ago • 2 comments

Description of feature

The stockholm format MSAs are extremely bulky files which leave a large disk footprint - particularly the uniprot.sto files generated for pairing sequences in AlphaFold2 multimer. It might be nice to convert MSA files to a3m format before emitting to publishdir.

tlitfin-unsw avatar Sep 02 '25 04:09 tlitfin-unsw

Maybe a compressed archive is a better solution to preserve the original data format.

tlitfin-unsw avatar Sep 25 '25 06:09 tlitfin-unsw

A quick fix we used is adding a compression step to afterScript. A similar script could be added to the module's main.nf

    withName: 'RUN_ALPHAFOLD2_MSA|RUN_ALPHAFOLD2' {
        afterScript = """
            find . -type f -name '*.sto' -exec zstd -19 --rm {} \\;
        """
    }

jscgh avatar Oct 18 '25 03:10 jscgh