forte
forte copied to clipboard
`DataPack.get()` has bad performance on large packs
The current logic for .get()
(which internally calls .get_entries()
can be summarized as:
- Create sets of valid entry IDs for the given constraints:
- Entry must be of type
entry_type
; - Entry must be generated by
component
; - Entry must be within span of
range_annotation
(if coverage index exists).
- Entry must be of type
- Take the intersection of these sets, denote it as the "valid set".
- If
entry_type
is Annotation, then a faster code path exists:- Find the lower and upper bounds of entry IDs given
range_annotation
. This is done using binary search on the Annotation index. - Iterate over all entries within range, further check its span, and yield those that satisfy the check.
- Find the lower and upper bounds of entry IDs given
- Otherwise, simply consider every entry in the valid set, and yield those that satisfy the span check.
Corresponding code is as follows: https://github.com/asyml/forte/blob/beae4e923c9a6873b582588972e6ec9919079271/forte/data/data_pack.py#L676-L735
A couple of performance optimizations:
- It is not necessary to create new sets for
entry_type
each time. For most of the type sets forentry_type
andcomponent
constraints will be very large (containing 30% of all IDs, or even more), and creating such indices may not be faster than iterating over everything. - When there is
range_annotation
(the common case), if coverage index exists, further span checks are redundant. - The optimization for Annotations can be similarly applied to Links and Groups, since they can also be represented as a single interval in the span check.