GenomeWorks icon indicating copy to clipboard operation
GenomeWorks copied to clipboard

[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files

Open edawson opened this issue 3 years ago • 0 comments

Despite updating the evaluate_paf script to handle queries better, the performance of the script is inadequate for large-scale CI jobs.

One solution to this is to ditch the interval tree data structure and instead rely on sorted PAF input. For large PAF files, this may still take a significant amount of time, though it should significantly reduce the memory usage (requiring only two PAF records to be kept in memory at a time; currently, all truth set records are maintained in memory).

Another option would be to provide random access to bgzipped PAF files, either through TABIX or some other API.

edawson avatar Sep 23 '20 17:09 edawson