forte icon indicating copy to clipboard operation
forte copied to clipboard

Refactoring Data Store Structure

Open Pushkar-Bhuse opened this issue 3 years ago • 3 comments

This PR fixes #874

Description of changes

This PR attempts to transform how data is stored in the DataStore entry. The main idea behind this new format is that all attributes of an entry other than its tid and type_name should be considered as data class attributes. As a result of this, the DataStore format of any entry can be visualized as : [tid, type_name, ....(dataclass attributes)...].

For example, the type_attrubutes of Sentence can be seen as

{
            "attributes": {
                "begin": 2,
                "end": 3,
                "payload_idx": 4,
                "speaker": 5,
                "part_id": 6,
                "sentiment": 7,
                "classification": 8,
                "classifications": 9,
            },
            "parent_class": set(),
        }

The way this is implemented is by creating a class variable for Entry called cached_attribute_data. This is a dict that stores the initial values of dataclass attributes. The implementation makes sure that before initializing a data store entry for a given entry object, the cached_attribute_data dict store all data class attributes and their initial values. There are two ways in which a dataclass attribute can be added to cached_attribute_data

  1. Attribute has a entry_setter property: In this case, the entry will automatically be added to cached_attribute_data.
  2. Attribute does not have a entry_setter property: In this case, attributes values are store in the entry object. These are fetched before the creation of the data store entry to populate cached_attribute_data.

Possible influences of this PR.

  1. Since all attributes are now dataclass attributes, we do not need to rely on constants. Instead, we use the function get_datastore_attr_idx to fetch the position in the datastore where a given attributes is stored.
  2. The new format makes DataStore more scalable now since any new attribute can be added to the entry as well as its datastore but declaring it as a dataclass attribute

Test Conducted

The main aim of this PR was to keep the outermost interface unchanged and still be able to pass the data_store_test, data_pack_test and multi_pack_test

Pushkar-Bhuse avatar Jul 18 '22 21:07 Pushkar-Bhuse

This PR does apply changes to CV Ontologies since it is currently getting updated itself.

Pushkar-Bhuse avatar Jul 18 '22 21:07 Pushkar-Bhuse

Codecov Report

Merging #882 (c5b3af8) into master (d6b137d) will increase coverage by 0.04%. The diff coverage is 93.02%.

@@            Coverage Diff             @@
##           master     #882      +/-   ##
==========================================
+ Coverage   80.91%   80.95%   +0.04%     
==========================================
  Files         254      254              
  Lines       19551    19569      +18     
==========================================
+ Hits        15819    15843      +24     
+ Misses       3732     3726       -6     
Impacted Files Coverage Δ
forte/data/extractors/relation_extractor.py 24.32% <0.00%> (ø)
tests/forte/data/data_pack_test.py 98.85% <ø> (ø)
forte/data/ontology/top.py 76.43% <90.78%> (-1.74%) :arrow_down:
forte/data/base_pack.py 76.17% <93.33%> (+0.51%) :arrow_up:
forte/data/data_store.py 93.24% <93.50%> (-0.07%) :arrow_down:
forte/common/constants.py 100.00% <100.00%> (ø)
forte/data/base_store.py 75.34% <100.00%> (-0.34%) :arrow_down:
forte/data/data_pack.py 85.90% <100.00%> (+0.02%) :arrow_up:
forte/data/entry_converter.py 88.31% <100.00%> (+5.41%) :arrow_up:
forte/data/ontology/core.py 76.95% <100.00%> (+0.08%) :arrow_up:
... and 6 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Aug 01 '22 17:08 codecov[bot]

The way this is implemented is by creating a class variable for Entry called cached_attribute_data. This is a dict that stores the initial values of dataclass attributes. The implementation makes sure that before initializing a data store entry for a given entry object, the cached_attribute_data dict store all data class attributes and their initial values. There are two ways in which a dataclass attribute can be added to cached_attribute_data

  1. Attribute has a entry_setter property: In this case, the entry will automatically be added to cached_attribute_data.
  2. Attribute does not have a entry_setter property: In this case, attributes values are store in the entry object. These are fetched before the creation of the data store entry to populate cached_attribute_data.

I'm thinking we might want to make the behavior consistent. Right now we are maintaining two distinctive approaches to set dataclass attributes before and after registering property function. But it's out of the scope of this PR and it's not of high priority.

mylibrar avatar Aug 03 '22 03:08 mylibrar