marklogic-data-hub
marklogic-data-hub copied to clipboard
RFE: provide a simple config for mastering / SMT with one final collection
While working with Smart Mastering Toolkit we found it useful to have one output collection that describes all the mastered data. This is both the "merged" records together with the "noMatch" records.
So for Person mastering we plan to:
- put all Person records into a "person-unprocessed" collection and a "person-content" collection during harmonization
- run an SMT flow which
- looks for records in the "person-unprocessed" collection (that is, the collector will gather URIs for all Person documents in that collection)
- has an onMerge collection configuration in the merge config that adds a "person-master" collection and removes "person-unprocessed".
- has an onNoMatch collection configuration in the merge config that does the same (adds person-master and removes person-unprocessed)
The end effects are
- maintain the overall collection of "person-content" for all records so mastering works over the full universe of data
- have all master documents in "person-mastered" regardless if they got there via a merge or by a no-match
- provide a single collection representing content that is yet to be processed
A simple configuration and documentation for this as a standard pattern will be helpful.
Default configurations that only specify this information would be even better. E.g. specify the "masterData" collection name as a default, and maintain it without having to add/remove collections to make that happen.
Even better would be to do the above default behavior using a convention on the Entity name
-
-content -
-mastered -
-unprocessed
requirng no configuration of collections at all