forte icon indicating copy to clipboard operation
forte copied to clipboard

`DataPack.get()` has bad performance on large packs

Open huzecong opened this issue 5 years ago • 0 comments

The current logic for .get() (which internally calls .get_entries() can be summarized as:

  1. Create sets of valid entry IDs for the given constraints:
    • Entry must be of type entry_type;
    • Entry must be generated by component;
    • Entry must be within span of range_annotation (if coverage index exists).
  2. Take the intersection of these sets, denote it as the "valid set".
  3. If entry_type is Annotation, then a faster code path exists:
    1. Find the lower and upper bounds of entry IDs given range_annotation. This is done using binary search on the Annotation index.
    2. Iterate over all entries within range, further check its span, and yield those that satisfy the check.
  4. Otherwise, simply consider every entry in the valid set, and yield those that satisfy the span check.

Corresponding code is as follows: https://github.com/asyml/forte/blob/beae4e923c9a6873b582588972e6ec9919079271/forte/data/data_pack.py#L676-L735

A couple of performance optimizations:

  1. It is not necessary to create new sets for entry_type each time. For most of the type sets for entry_type and component constraints will be very large (containing 30% of all IDs, or even more), and creating such indices may not be faster than iterating over everything.
  2. When there is range_annotation (the common case), if coverage index exists, further span checks are redundant.
  3. The optimization for Annotations can be similarly applied to Links and Groups, since they can also be represented as a single interval in the span check.

huzecong avatar Nov 02 '19 02:11 huzecong