Refactoring Data Store Structure
This PR fixes #874
Description of changes
This PR attempts to transform how data is stored in the DataStore entry. The main idea behind this new format is that all attributes of an entry other than its tid and type_name should be considered as data class attributes. As a result of this, the DataStore format of any entry can be visualized as : [tid, type_name, ....(dataclass attributes)...].
For example, the type_attrubutes of Sentence can be seen as
{
"attributes": {
"begin": 2,
"end": 3,
"payload_idx": 4,
"speaker": 5,
"part_id": 6,
"sentiment": 7,
"classification": 8,
"classifications": 9,
},
"parent_class": set(),
}
The way this is implemented is by creating a class variable for Entry called cached_attribute_data. This is a dict that stores the initial values of dataclass attributes. The implementation makes sure that before initializing a data store entry for a given entry object, the cached_attribute_data dict store all data class attributes and their initial values. There are two ways in which a dataclass attribute can be added to cached_attribute_data
- Attribute has a
entry_setterproperty: In this case, the entry will automatically be added tocached_attribute_data. - Attribute does not have a
entry_setterproperty: In this case, attributes values are store in the entry object. These are fetched before the creation of the data store entry to populatecached_attribute_data.
Possible influences of this PR.
- Since all attributes are now
dataclassattributes, we do not need to rely onconstants. Instead, we use the functionget_datastore_attr_idxto fetch the position in the datastore where a given attributes is stored. - The new format makes
DataStoremore scalable now since any new attribute can be added to the entry as well as its datastore but declaring it as adataclassattribute
Test Conducted
The main aim of this PR was to keep the outermost interface unchanged and still be able to pass the data_store_test, data_pack_test and multi_pack_test
This PR does apply changes to CV Ontologies since it is currently getting updated itself.
Codecov Report
Merging #882 (c5b3af8) into master (d6b137d) will increase coverage by
0.04%. The diff coverage is93.02%.
@@ Coverage Diff @@
## master #882 +/- ##
==========================================
+ Coverage 80.91% 80.95% +0.04%
==========================================
Files 254 254
Lines 19551 19569 +18
==========================================
+ Hits 15819 15843 +24
+ Misses 3732 3726 -6
| Impacted Files | Coverage Δ | |
|---|---|---|
| forte/data/extractors/relation_extractor.py | 24.32% <0.00%> (ø) |
|
| tests/forte/data/data_pack_test.py | 98.85% <ø> (ø) |
|
| forte/data/ontology/top.py | 76.43% <90.78%> (-1.74%) |
:arrow_down: |
| forte/data/base_pack.py | 76.17% <93.33%> (+0.51%) |
:arrow_up: |
| forte/data/data_store.py | 93.24% <93.50%> (-0.07%) |
:arrow_down: |
| forte/common/constants.py | 100.00% <100.00%> (ø) |
|
| forte/data/base_store.py | 75.34% <100.00%> (-0.34%) |
:arrow_down: |
| forte/data/data_pack.py | 85.90% <100.00%> (+0.02%) |
:arrow_up: |
| forte/data/entry_converter.py | 88.31% <100.00%> (+5.41%) |
:arrow_up: |
| forte/data/ontology/core.py | 76.95% <100.00%> (+0.08%) |
:arrow_up: |
| ... and 6 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
The way this is implemented is by creating a class variable for
Entrycalledcached_attribute_data. This is adictthat stores the initial values ofdataclassattributes. The implementation makes sure that before initializing a data store entry for a given entry object, thecached_attribute_datadict store all data class attributes and their initial values. There are two ways in which adataclassattribute can be added tocached_attribute_data
- Attribute has a
entry_setterproperty: In this case, the entry will automatically be added tocached_attribute_data.- Attribute does not have a
entry_setterproperty: In this case, attributes values are store in the entry object. These are fetched before the creation of the data store entry to populatecached_attribute_data.
I'm thinking we might want to make the behavior consistent. Right now we are maintaining two distinctive approaches to set dataclass attributes before and after registering property function. But it's out of the scope of this PR and it's not of high priority.