varCA icon indicating copy to clipboard operation
varCA copied to clipboard

a smaller test dataset

Open aryarm opened this issue 3 years ago • 0 comments

Our current test dataset comprises all of chr1 in two different samples: the Jurkat sample and the MOLT4 cell line. It takes about an hour to run the entire pipeline with this dataset.

Ideally, we would have a dataset that runs in under 10 mins or so. This could then be incorporated into a Github CI pipeline that runs automatically upon release of each major and minor version increment, so that we can know when a change that we've made to the code leads to a change in the results.

  • [x] find SNVs and indels supported by all callers
  • [x] choose just one or two peaks that overlap those variants from each of the two samples
  • [x] subset the example dataset to reads that only overlap those peaks
  • [x] also try to subset the reference genome that is packaged with the example data, since the ref genome appears to be the largest file, right now
  • [x] rerun the pipeline with the smaller dataset and tweak the dataset as necessary to make it run quickly
  • [ ] use snakemake --generate-unit-tests to create a bunch of tests that can be executed using pytest
    • I'm running into issues with this. It doesn't work for outputs marked as pipe and there are some problems with other directories (see snakemake/snakemake#1104)
    • [ ] fix issues and ensure test coverage is appropriate
    • [ ] remove any unnecessary tests to ensure the test directory is small and can be properly included in version history (edit: this won't be possible, after all - b/c the test directory has to include the outputs of each rule ugh)
  • [ ] (optionally) create a Github action like this one to execute pytest upon each major or minor version increment and confirm the tests pass successfully

aryarm avatar Jul 04 '21 17:07 aryarm