ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Remove whitespace indents from large Auspice JSONs

Open fanninpm opened this issue 2 years ago • 1 comments

Description of proposed changes

This PR reduces the size of the main Auspice JSONs by removing all horizontal indentation from them.

Related issue(s)

No related issues.

Testing

I tested this with two local builds on my machine (called ohio and usa). For a fair side-by-side comparison, I used Python's json.tool module to delete the indentation:

$ python3.10 -m json.tool --indent 0 auspice/ncov_ohio.json auspice/ncov_ohio_test.json
$ python3.10 -m json.tool --indent 0 auspice/ncov_usa.json auspice/ncov_usa_test.json

This resulted in the following space savings (give or take a trailing newline):

$ ls -l auspice/
total 57768
-rw-r--r-- 1 fanninpm fanninpm  2857736 May  8 02:57 ncov_ohio.json
-rw-r--r-- 1 fanninpm fanninpm    39894 May  8 02:57 ncov_ohio_root-sequence.json
-rw-r--r-- 1 fanninpm fanninpm   641215 May  8 15:07 ncov_ohio_test.json
-rw-r--r-- 1 fanninpm fanninpm   197365 May  8 02:57 ncov_ohio_tip-frequencies.json
-rw-r--r-- 1 fanninpm fanninpm 47854799 May  8 02:45 ncov_usa.json
-rw-r--r-- 1 fanninpm fanninpm    39894 May  8 02:45 ncov_usa_root-sequence.json
-rw-r--r-- 1 fanninpm fanninpm  5729468 May  8 10:57 ncov_usa_test.json
-rw-r--r-- 1 fanninpm fanninpm  1776605 May  8 02:45 ncov_usa_tip-frequencies.json

I believe this comes out to 78% space savings for the ohio build and 88% for the usa build.

(N.B. the feature I used in this tool was added in Python 3.9, which is newer than the Python version in the nextstrain/base Docker image.)

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • [ ] Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
  • [ ] Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • [ ] After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

  • [ ] Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

fanninpm avatar May 09 '22 02:05 fanninpm

Here's my build file, in case you want to reproduce this:

my_profiles/test-data.yml
inputs:
  - name: reference_data
    metadata: https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz
    aligned: https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz

# GenBank data includes "Wuhan-Hu-1/2019" which we use as the root for this build.
refine:
  root: "Wuhan-Hu-1/2019"

builds:
  usa:
    subsampling_scheme: country_subsampling
    region: North America
    country: USA
    title: "SARS-CoV-2 Sequences in USA (2,000 focal sequences)"
  ohio:
    subsampling_scheme: division_subsampling
    region: North America
    country: USA
    division: Ohio
    title: "SARS-CoV-2 Sequences in Ohio (200 focal sequences)"

subsampling:
  country_subsampling:
    country:
      group_by: "division year month"
      max_sequences: 2000
      query: --query '(region == "{region}") & (country == "{country}")'
    contextual:
      group_by: "country year month"
      max_sequences: 1000
      query: --query '(region == "{region}") & (country != "{country}")'
      priorities:
        type: proximity
        focus: country
    global:
      group_by: "country year month"
      max_sequences: 500
      query: --query 'region != "{region}"'
  division_subsampling:
    division:
      group_by: "year month"
      max_sequences: 200
      query: --query '(region == "{region}") & (country == "{country}") & (division == "{division}")'
    contextual:
      group_by: "country year month"
      max_sequences: 100
      query: --query 'division != "{division}"'
      priorities:
        type: proximity
        focus: division
    global:
      group_by: "country year month"
      max_sequences: 50
      query: --query 'division != "{division}"'

fanninpm avatar May 09 '22 02:05 fanninpm