indra Optimizing evidence representation

This PR implements two optimizations to the representation of evidences that significantly decrease memory usage when manipulating large sets of INDRA Statements. The bulk of memory used by INDRA Statements is attributable to the Evidence objects (incl. evidence text) that are attached to them. One approach to decrease memory usage is to define the __slots__ attribute of Evidence to make sure the set of attributes it can have is pre-defined (rather than variable via a __dict__ attribute). This seemed to make a minor difference in memory usage. Much larger memory savings can be achieved if lists of Evidences attached to a Statement are stored in a serialized, compressed form, and only decompressed and deserialized when being accessed. Based on some experiments, a Statement with 100 pieces of Evidence uses 75% less memory using this PR. On some large assembled corpora that I tried, which have Statements with a mixture of number of Evidences, 80% lower memory usage is typical.

Not much of this affects the way INDRA Statements are used, however there is one important difference: when accessing a Statement's evidence (i.e., stmt.evidence) one gets a view of the list evidences rather than a reference to them. So directly manipulating stmt.evidence will not result in persistent changes to the Statement. Rather, one has to do something like:

evs = stmt.evidence
for ev in evs:
    # Make some changes to each ev object
stmt.evidence = evs

to make changes to a Statement's list of Evidences. Some specialized code dealing with Evidence manipulation, as well as some tests needed to be updated. I am still ambivalent about whether this change will cause confusion later, and therefore not sure yet if this PR should be merged.

Nov 05 '19 16:11 bgyori

From my point of view, this new API is pretty confusing. It's unclear why saving in a variable solves this problem

Nov 12 '19 17:11 cthoyt

Well, users of INDRA would never really notice any change, it's only during internal development (of e.g., pre-assembly algorithms or input processors) that one could make a mistake by attempting to change a view of a list of Evidences rather than the actual evidence attribute of a Statement. Saving into a variable is not really necessary, the key is just to always set evidences as stmt.evidence = [...] to update the actual evidence list attribute rather than attempt to iterate over and manipulate stmt.evidence[idx] directly, which with this change would just change a view of the evidences. I agree it is somewhat confusing hence my ambivalence about the change.

Nov 12 '19 18:11 bgyori

indra indra copied to clipboard

Optimizing evidence representation

indra
indra copied to clipboard