ncov
ncov copied to clipboard
Use weighted sampling for Asia builds
Description of proposed changes
This replaces the Asia/China/India split with population-based weighted sampling (possible with https://github.com/nextstrain/augur/pull/1454).
Previews
Analysis
I think a good comparison is gisaid/asia/all-time before and after weighted sampling. Some notes from that comparison:
- With weighted sampling, more sequences are included for large countries such as Indonesia which did not get the same treatment as China and India prior to this PR.
- With weighted sampling, there is a high chance that no sequences are included for small countries such as Maldives.
- The sampling difference results in some subtle but maybe significant differences in the frequencies panel.
Caveat: weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries will still be under-sampled even if it has a high weight, and results in significantly fewer total sequences than requested by --subsample-max-sequences
. This is already the case with the uniform sampling approach prior to this PR, but it may be more noticeable now with large countries that are under-sampled.
Questions for reviewers
- What can be done to check that the sampling results are appropriate?
- ✅ Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data. (population size yes, case counts no)
Checklist
- [ ] Sampling results look ok?
- [ ] Drop 1afb6d7e5278bad4a69801411d791d5b2cd7d5e8
- [ ] Merge+release https://github.com/nextstrain/augur/pull/1454
- [ ] Post-merge: consider this for other region builds?
Release checklist
If this pull request introduces new features, complete the following steps:
- [ ] Update
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.
Not ready to be merged but opening for initial review and feedback to shape implementation in https://github.com/nextstrain/augur/pull/1454.
I was thinking per-country case counts over time might be better, but not sure where to get that data.
We ingest country case counts from OWID in forecasts-ncov and upload them to s3://nextstrain-data/files/workflows/forecasts-ncov/cases/global.tsv.gz
.
Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data.
Case counts no longer mean anything and haven't really meant anything since early 2022. We can see this today in the parsed case counts file country_week_weights.tsv
where Afghanistan with a population of 41M reported 638 cases in the week of (2024, 20), while Brunei with a population of 500k reported 175 cases in this same time period. This is 1.5 cases per 100k in Afganistran and 35 cases per 100k in Brunei. This 23-fold difference is going to be almost entirely reporting rate differences.
Similarly, South Korea has 135,331 cases listed in your TSV for (2024,20). This is 265 cases per 100k and so a 176-fold difference in reporting rate relative to Afghanistan.
If you do subsampling based on reported cases it will strongly bias included samples to (wealthier) countries with better surveillance.
For ongoing ncov builds, we should be weighting on population size. I wouldn't worry about admin division population size and would just worry about country-level population sizes in terms of weighting.
I'd start by rolling back 6b6cd2b5cfb96b5a4eaa3a7fdfc9585a6a367e8c. Ah! I see this was already discussed here https://github.com/nextstrain/ncov/pull/1106#discussion_r1607411279. Can you update PR with what I should be working from for review?
Following from this, I worked from 1afb6d7e5278bad4a69801411d791d5b2cd7d5e8 and tried:
python3 scripts/get_population_sizes.py --output data/country_population_weights.tsv
from a freshly updated nextstrain shell
command and got the following error
Traceback (most recent call last):
File "/nextstrain/build/scripts/get_population_sizes.py", line 67, in <module>
export_imf(args.output)
File "/nextstrain/build/scripts/get_population_sizes.py", line 13, in export_imf
workbook = xlrd.open_workbook(file_contents=excel_contents, ignore_workbook_corruption=True)
TypeError: open_workbook() got an unexpected keyword argument 'ignore_workbook_corruption'
and if I drop ignore_workbook_corruption=True
in line 13 of the script I get an error about a corrupted workbook:
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4
https://www.cia.gov/the-world-factbook/field/population/country-comparison/#:~:text=DOWNLOAD-,DATA,-Rank looks potentially cleaner.
Another suggestion here: my preference for this sort of thing is to version the data. It's much easier to think about and it's not something that needs to be updated day-to-day. I'd just run a script to prepare country_population_weights.tsv
once and not have it be part of the workflow. I'd version this to defaults/country_population_weights.tsv
. This particular file is very similar to defaults/mutational_fitness_distance_map
. In that case I had used scripts/developer_scripts/parse_mutational_fitness_tsv_into_distance_map
, with the developer_scripts/
placement to signal that this is not part of the regular workflow.
I've dropped 6b6cd2b5cfb96b5a4eaa3a7fdfc9585a6a367e8c and updated f038890d04863b92d0da2db44fc798b512510e8f to use "the world factbook" as source of population data. Also moved the script and data files into version control (great suggestion).
I'm going to run this through the trial builds to check if the updated data source contains the right country names.
Cleaned up the commits and updated trial build links in the PR description. Everything is still draft with a few lingering FIXMEs in the changes. Next week I plan to shift back to https://github.com/nextstrain/augur/pull/1454 and tweak the implementation logic before coming back here.
I really like where this is going! However, I ran into an error when trying to run this locally. In this filter step:
augur filter --metadata nextstrain-ncov-private/100k/metadata.tsv.xz
--include defaults/include.txt
--exclude defaults/exclude.txt
--max-date 6M
--exclude-where 'region!=Asia'
--group-by country year month
--group-by-weights defaults/population_weights.tsv
--subsample-max-sequences 700
--output-strains results/asia_6m/sample-asia_early.txt
WARNING: Asked to provide at most 700 sequences, but there are 1601 groups.
Sampling with weights defined by defaults/population_weights.tsv.
NOTE: Skipping 101242 groups due to lack of entries in metadata.
NOTE: Weights were not provided for the columns 'year', 'month'. Using equal weights across values in those columns.
ERROR: 27 groups appear in the metadata but are missing from the weights file. Re-run with --output-group-by-missing-weights to continue.
If I then run with adding --output-group-by-missing-weights missing-weights.txt
it completes successfully with standard out:
WARNING: Asked to provide at most 700 sequences, but there are 1601 groups.
Sampling with weights defined by defaults/population_weights.tsv.
NOTE: Skipping 101242 groups due to lack of entries in metadata.
NOTE: Weights were not provided for the columns 'month', 'year'. Using equal weights across values in those columns.
WARNING: 27 groups appear in the metadata but are missing from the weights file. Sequences from these groups will be dropped.
All missing groups added to 'missing-weights.txt'.
91656 strains were dropped during filtering
69341 were dropped because of 'region!=Asia'
4791 were dropped because they were later than 2024.09 or missing a date
1 was added back because it was in defaults/include.txt
17525 were dropped because of subsampling criteria
695 strains passed all filters
The missing weights file looks like:
country year month weight
Myanmar 2020 (2020, 4)
Myanmar 2020 (2020, 8)
Myanmar 2021 (2021, 1)
I can see that Myanmar is missing from population_weights.tsv
. Two things:
- I think that ncov needs to have
--output-group-by-missing-weights
- Double check entries in
population_weights.tsv
to make sure all countries in our metadata are present.
I think the behavior off erroring out is maybe not the best, but I can discuss this over in https://github.com/nextstrain/augur/pull/1454.
@trvrb thanks for taking it for a spin, sorry that you ran into another error 😞
the behavior off erroring out is maybe not the best, but I can discuss this over in https://github.com/nextstrain/augur/pull/1454.
I think appropriate to discuss here since your scenario is a good example.
The erroring behavior is what I intended. The weights file should be comprehensive, otherwise sequences get dropped due to lack of weights.
For this PR, that means population_weights.tsv
should have weights for everything under region=Asia
. I assumed running rebuild-gisaid.yml would be sufficient to test this, but that didn't include sequences from Myanmar (not sure why, I'll investigate separately).
The proper fix here is to update the country name mapping in get_population_weights with "Burma": "Myanmar"
(your 2nd point).
I think that ncov needs to have
--output-group-by-missing-weights
I think differently. --output-group-by-missing-weights
is meant more as an immediate workaround rather than something to use by default. If the ncov workflow uses this option by default, it's less likely that we'd catch any discrepancies since WARNINGs in output are not something we go back and check regularly.
Example scenario: right now we know population_weights.tsv
is comprehensive, but sometime in the future new sequences are added under a country name that differs from what's in population_weights.tsv
. This would still produce "successful" automated builds and one would only notice that those new sequences are being dropped if the Snakemake logs are inspected.
I assumed running rebuild-gisaid.yml would be sufficient to test this, but that didn't include sequences from Myanmar (not sure why, I'll investigate separately).
I re-ran rebuild-gisaid.yml and it failed with the same error. I suspect that I ran the last trial build before pushing error handling updates so it silently succeeded by dropping all Myanmar sequences.
The error message has been improved:
ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']
Re-run with --output-group-by-missing-weights to continue.
This confirms that Myanmar is the only country missing from population_weights.tsv
. I'll push a fix to add Myanmar weight, re-run trial build, and update preview links in the PR description.
Thanks for the feedback @victorlin. I have a couple thoughts:
- In decent fraction of scenarios, I believe I'm going to want a "default" value for
population_weights.tsv
. As a simple example, I might want to specify--group-by country
and--group-by-weights weights.tsv
with entries of
country weight
USA 100
default 1
where I don't want to have to semi-manually make sure to include every possible country as 1, just to be able to specify that I want sampling more intensively from the USA than other countries. In many situations you'll have a category like host
where you won't be sure what exactly will come through and you can't enumerate ahead of time. If we were to add weighted sampling here https://next.nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome we'd run into random new hosts continually breaking the build. I'd want a simple default to avoid this erroring.
However, I do totally see why you'd want to require an explicit default
value rather than having an implicit default weight for entries in metadata that aren't present in weights.tsv
.
- It feels a little strange to tie "don't error out on missing values" to
--output-group-by-missing-weights
. They seem semantically distinct to me. By giving me the option of not erroring out by including--output-group-by-missing-weights missing-weights.txt
I wanted to always include this option so that the build wouldn't error out (rather than as a one-off strategy to find and fix missing values).
My suggestion would be to just drop --output-group-by-missing-weights
as an option. If there are missing values they should always be printed. I really like:
ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']
This also slightly slims down the very long list of augur filter
command line arguments. There's already a bunch of output-
options. This also stops the user from having to run the command again with an additional argument.
So, scenarios would be
- Missing values in
weights.tsv
with nodefault
set – Error out, print to standard out the missing values that need to be specified - Missing values in
weights.tsv
withdefault
set – Proceed, but still print to standard out the missing values
@trvrb thanks for the example use case and suggestion – both were insightful. I've implemented in the Augur PR as https://github.com/nextstrain/augur/commit/a00d3b52037887da49fe41d15de4430c507ed267.
For the example here, the scenarios now look like:
-
ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv': - 'country': ['Myanmar'] To fix this, either: (1) specify weights explicitly - add entries to 'defaults/population_weights.tsv' for the values above, or (2) specify a default weight - add an entry to 'defaults/population_weights.tsv' with the value 'default' for all columns
-
WARNING: The input metadata contains these values under the following columns that are not directly covered by 'defaults/population_weights.tsv': - 'country': ['Myanmar'] The default weight of 1 will be used for all groups defined by those values. ...
I'd suggest a trial build just to make sure nothing is broken.
Rebuilding the docker image to run trial builds now. Once that's done, I'll update the links in the PR description. I'll also take a look and note if anything seems off compared to the previous run.
And then maybe doing a second PR to extend weighted sampling to other regions?
I've created an issue to track rollout of weighted sampling in this workflow: #1141