GenomeWorks
GenomeWorks copied to clipboard
[pygenomeworks] evaluate_paf script is too slow to be practical for very large PAF files
Despite updating the evaluate_paf script to handle queries better, the performance of the script is inadequate for large-scale CI jobs.
One solution to this is to ditch the interval tree data structure and instead rely on sorted PAF input. For large PAF files, this may still take a significant amount of time, though it should significantly reduce the memory usage (requiring only two PAF records to be kept in memory at a time; currently, all truth set records are maintained in memory).
Another option would be to provide random access to bgzipped PAF files, either through TABIX or some other API.