jesterj
jesterj copied to clipboard
Thoughts on data provenance
A few thoughts on this:
- Terminology: "data provenance" rather than "FTI / fault tolerant indexing"
- Use domain driven development methodology and push 3rd party dependencies to the edges of the framework's architecture. To that end, the persistence would be factoried rather than tightly coupled. Advantages being of course being able to plug-n-play if a better suited 3rd party provider comes about.
- Additionally, there may be operational requirements in some deployments which will require tight control over how data is persisted and where. If we have an interface for persistence, a custom persistence can be easily plugged in (e.g. to an RDBMS, however inefficient that may be).
FTI and DP are not quite the same. DP is an auditing related use case in which we would want to record things for later processing in a write only manner. FTI is happy to loose previous information as long as it can know if the document was fully processed or needs to be reprocessed when the application restarts. I think a DP scheme will most likely involve a dedicated DP log file, but I think this can be added down the road. Converting this to an enhancement and help wanted with the intent that the first step (before code please) is to describe the intended design.
Data provenance is a major undertaking and there are many other things to do first. I'm going to close this ticket and when the topic comes up again I or the contributor that wishes to implement it can write detailed specification in a new ticket.