ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Use weighted sampling for Asia builds

Open victorlin opened this issue 9 months ago • 6 comments

Description of proposed changes

This replaces the Asia/China/India split with population-based weighted sampling (possible with https://github.com/nextstrain/augur/pull/1454).

Previews

Analysis

I think a good comparison is gisaid/asia/all-time before and after weighted sampling. Some notes from that comparison:

  1. With weighted sampling, more sequences are included for large countries such as Indonesia which did not get the same treatment as China and India prior to this PR.
  2. With weighted sampling, there is a high chance that no sequences are included for small countries such as Maldives.
  3. The sampling difference results in some subtle but maybe significant differences in the frequencies panel.

Caveat: weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries will still be under-sampled even if it has a high weight, and results in significantly fewer total sequences than requested by --subsample-max-sequences. This is already the case with the uniform sampling approach prior to this PR, but it may be more noticeable now with large countries that are under-sampled.

Questions for reviewers

  1. What can be done to check that the sampling results are appropriate?
  2. ✅ Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data. (population size yes, case counts no)

Checklist

  • [ ] Sampling results look ok?
  • [ ] Drop 1afb6d7e5278bad4a69801411d791d5b2cd7d5e8
  • [ ] Merge+release https://github.com/nextstrain/augur/pull/1454
  • [ ] Post-merge: consider this for other region builds?

Release checklist

If this pull request introduces new features, complete the following steps:

  • [ ] Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

victorlin avatar May 03 '24 23:05 victorlin

Not ready to be merged but opening for initial review and feedback to shape implementation in https://github.com/nextstrain/augur/pull/1454.

victorlin avatar May 08 '24 17:05 victorlin

I was thinking per-country case counts over time might be better, but not sure where to get that data.

We ingest country case counts from OWID in forecasts-ncov and upload them to s3://nextstrain-data/files/workflows/forecasts-ncov/cases/global.tsv.gz.

joverlee521 avatar May 08 '24 23:05 joverlee521

Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data.

Case counts no longer mean anything and haven't really meant anything since early 2022. We can see this today in the parsed case counts file country_week_weights.tsv where Afghanistan with a population of 41M reported 638 cases in the week of (2024, 20), while Brunei with a population of 500k reported 175 cases in this same time period. This is 1.5 cases per 100k in Afganistran and 35 cases per 100k in Brunei. This 23-fold difference is going to be almost entirely reporting rate differences.

Similarly, South Korea has 135,331 cases listed in your TSV for (2024,20). This is 265 cases per 100k and so a 176-fold difference in reporting rate relative to Afghanistan.

If you do subsampling based on reported cases it will strongly bias included samples to (wealthier) countries with better surveillance.

For ongoing ncov builds, we should be weighting on population size. I wouldn't worry about admin division population size and would just worry about country-level population sizes in terms of weighting.

I'd start by rolling back 6b6cd2b5cfb96b5a4eaa3a7fdfc9585a6a367e8c. Ah! I see this was already discussed here https://github.com/nextstrain/ncov/pull/1106#discussion_r1607411279. Can you update PR with what I should be working from for review?

trvrb avatar Jun 06 '24 00:06 trvrb

Following from this, I worked from 1afb6d7e5278bad4a69801411d791d5b2cd7d5e8 and tried:

python3 scripts/get_population_sizes.py --output data/country_population_weights.tsv

from a freshly updated nextstrain shell command and got the following error

Traceback (most recent call last):
  File "/nextstrain/build/scripts/get_population_sizes.py", line 67, in <module>
    export_imf(args.output)
  File "/nextstrain/build/scripts/get_population_sizes.py", line 13, in export_imf
    workbook = xlrd.open_workbook(file_contents=excel_contents, ignore_workbook_corruption=True)
TypeError: open_workbook() got an unexpected keyword argument 'ignore_workbook_corruption'

and if I drop ignore_workbook_corruption=True in line 13 of the script I get an error about a corrupted workbook:

xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4

https://www.cia.gov/the-world-factbook/field/population/country-comparison/#:~:text=DOWNLOAD-,DATA,-Rank looks potentially cleaner.

Another suggestion here: my preference for this sort of thing is to version the data. It's much easier to think about and it's not something that needs to be updated day-to-day. I'd just run a script to prepare country_population_weights.tsv once and not have it be part of the workflow. I'd version this to defaults/country_population_weights.tsv. This particular file is very similar to defaults/mutational_fitness_distance_map. In that case I had used scripts/developer_scripts/parse_mutational_fitness_tsv_into_distance_map, with the developer_scripts/ placement to signal that this is not part of the regular workflow.

trvrb avatar Jun 06 '24 00:06 trvrb

I've dropped 6b6cd2b5cfb96b5a4eaa3a7fdfc9585a6a367e8c and updated f038890d04863b92d0da2db44fc798b512510e8f to use "the world factbook" as source of population data. Also moved the script and data files into version control (great suggestion).

I'm going to run this through the trial builds to check if the updated data source contains the right country names.

victorlin avatar Jun 06 '24 22:06 victorlin

Cleaned up the commits and updated trial build links in the PR description. Everything is still draft with a few lingering FIXMEs in the changes. Next week I plan to shift back to https://github.com/nextstrain/augur/pull/1454 and tweak the implementation logic before coming back here.

victorlin avatar Jun 07 '24 23:06 victorlin

I really like where this is going! However, I ran into an error when trying to run this locally. In this filter step:

augur filter --metadata nextstrain-ncov-private/100k/metadata.tsv.xz
  --include defaults/include.txt
  --exclude defaults/exclude.txt
  --max-date 6M
  --exclude-where 'region!=Asia'
  --group-by country year month
  --group-by-weights defaults/population_weights.tsv
  --subsample-max-sequences 700
  --output-strains results/asia_6m/sample-asia_early.txt
WARNING: Asked to provide at most 700 sequences, but there are 1601 groups.
Sampling with weights defined by defaults/population_weights.tsv.
NOTE: Skipping 101242 groups due to lack of entries in metadata.
NOTE: Weights were not provided for the columns 'year', 'month'. Using equal weights across values in those columns.
ERROR: 27 groups appear in the metadata but are missing from the weights file. Re-run with --output-group-by-missing-weights to continue.

If I then run with adding --output-group-by-missing-weights missing-weights.txt it completes successfully with standard out:

WARNING: Asked to provide at most 700 sequences, but there are 1601 groups.
Sampling with weights defined by defaults/population_weights.tsv.
NOTE: Skipping 101242 groups due to lack of entries in metadata.
NOTE: Weights were not provided for the columns 'month', 'year'. Using equal weights across values in those columns.
WARNING: 27 groups appear in the metadata but are missing from the weights file. Sequences from these groups will be dropped.
All missing groups added to 'missing-weights.txt'.
91656 strains were dropped during filtering
	69341 were dropped because of 'region!=Asia'
	4791 were dropped because they were later than 2024.09 or missing a date
	1 was added back because it was in defaults/include.txt
	17525 were dropped because of subsampling criteria
695 strains passed all filters

The missing weights file looks like:

country	year	month	weight
Myanmar	2020	(2020, 4)	
Myanmar	2020	(2020, 8)	
Myanmar	2021	(2021, 1)	

I can see that Myanmar is missing from population_weights.tsv. Two things:

  1. I think that ncov needs to have --output-group-by-missing-weights
  2. Double check entries in population_weights.tsv to make sure all countries in our metadata are present.

I think the behavior off erroring out is maybe not the best, but I can discuss this over in https://github.com/nextstrain/augur/pull/1454.

trvrb avatar Aug 02 '24 23:08 trvrb

@trvrb thanks for taking it for a spin, sorry that you ran into another error 😞

the behavior off erroring out is maybe not the best, but I can discuss this over in https://github.com/nextstrain/augur/pull/1454.

I think appropriate to discuss here since your scenario is a good example.

The erroring behavior is what I intended. The weights file should be comprehensive, otherwise sequences get dropped due to lack of weights.

For this PR, that means population_weights.tsv should have weights for everything under region=Asia. I assumed running rebuild-gisaid.yml would be sufficient to test this, but that didn't include sequences from Myanmar (not sure why, I'll investigate separately).

The proper fix here is to update the country name mapping in get_population_weights with "Burma": "Myanmar" (your 2nd point).

I think that ncov needs to have --output-group-by-missing-weights

I think differently. --output-group-by-missing-weights is meant more as an immediate workaround rather than something to use by default. If the ncov workflow uses this option by default, it's less likely that we'd catch any discrepancies since WARNINGs in output are not something we go back and check regularly.

Example scenario: right now we know population_weights.tsv is comprehensive, but sometime in the future new sequences are added under a country name that differs from what's in population_weights.tsv. This would still produce "successful" automated builds and one would only notice that those new sequences are being dropped if the Snakemake logs are inspected.

victorlin avatar Aug 05 '24 21:08 victorlin

I assumed running rebuild-gisaid.yml would be sufficient to test this, but that didn't include sequences from Myanmar (not sure why, I'll investigate separately).

I re-ran rebuild-gisaid.yml and it failed with the same error. I suspect that I ran the last trial build before pushing error handling updates so it silently succeeded by dropping all Myanmar sequences.

The error message has been improved:

ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']
Re-run with --output-group-by-missing-weights to continue.

This confirms that Myanmar is the only country missing from population_weights.tsv. I'll push a fix to add Myanmar weight, re-run trial build, and update preview links in the PR description.

victorlin avatar Aug 06 '24 18:08 victorlin

Thanks for the feedback @victorlin. I have a couple thoughts:

  1. In decent fraction of scenarios, I believe I'm going to want a "default" value for population_weights.tsv. As a simple example, I might want to specify --group-by country and --group-by-weights weights.tsv with entries of
country weight
USA     100
default 1

where I don't want to have to semi-manually make sure to include every possible country as 1, just to be able to specify that I want sampling more intensively from the USA than other countries. In many situations you'll have a category like host where you won't be sure what exactly will come through and you can't enumerate ahead of time. If we were to add weighted sampling here https://next.nextstrain.org/avian-flu/h5n1-cattle-outbreak/genome we'd run into random new hosts continually breaking the build. I'd want a simple default to avoid this erroring.

However, I do totally see why you'd want to require an explicit default value rather than having an implicit default weight for entries in metadata that aren't present in weights.tsv.

  1. It feels a little strange to tie "don't error out on missing values" to --output-group-by-missing-weights. They seem semantically distinct to me. By giving me the option of not erroring out by including --output-group-by-missing-weights missing-weights.txt I wanted to always include this option so that the build wouldn't error out (rather than as a one-off strategy to find and fix missing values).

My suggestion would be to just drop --output-group-by-missing-weights as an option. If there are missing values they should always be printed. I really like:

ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
- 'country': ['Myanmar']

This also slightly slims down the very long list of augur filter command line arguments. There's already a bunch of output- options. This also stops the user from having to run the command again with an additional argument.

So, scenarios would be

  • Missing values in weights.tsv with no default set – Error out, print to standard out the missing values that need to be specified
  • Missing values in weights.tsv with default set – Proceed, but still print to standard out the missing values

trvrb avatar Aug 08 '24 00:08 trvrb

@trvrb thanks for the example use case and suggestion – both were insightful. I've implemented in the Augur PR as https://github.com/nextstrain/augur/commit/a00d3b52037887da49fe41d15de4430c507ed267.

For the example here, the scenarios now look like:

  • ERROR: The input metadata contains these values under the following columns that are not covered by 'defaults/population_weights.tsv':
    - 'country': ['Myanmar']
    To fix this, either:
    (1) specify weights explicitly - add entries to 'defaults/population_weights.tsv' for the values above, or
    (2) specify a default weight - add an entry to 'defaults/population_weights.tsv' with the value 'default' for all columns
    
  • WARNING: The input metadata contains these values under the following columns that are not directly covered by 'defaults/population_weights.tsv':
    - 'country': ['Myanmar']
    The default weight of 1 will be used for all groups defined by those values.
    ...
    

victorlin avatar Aug 13 '24 22:08 victorlin

I'd suggest a trial build just to make sure nothing is broken.

Rebuilding the docker image to run trial builds now. Once that's done, I'll update the links in the PR description. I'll also take a look and note if anything seems off compared to the previous run.

And then maybe doing a second PR to extend weighted sampling to other regions?

I've created an issue to track rollout of weighted sampling in this workflow: #1141

victorlin avatar Aug 14 '24 18:08 victorlin