jesterj icon indicating copy to clipboard operation
jesterj copied to clipboard

Thoughts on data provenance

Open dgoldenberg1234 opened this issue 8 years ago • 1 comments

A few thoughts on this:

  • Terminology: "data provenance" rather than "FTI / fault tolerant indexing"
  • Use domain driven development methodology and push 3rd party dependencies to the edges of the framework's architecture. To that end, the persistence would be factoried rather than tightly coupled. Advantages being of course being able to plug-n-play if a better suited 3rd party provider comes about.
  • Additionally, there may be operational requirements in some deployments which will require tight control over how data is persisted and where. If we have an interface for persistence, a custom persistence can be easily plugged in (e.g. to an RDBMS, however inefficient that may be).

dgoldenberg1234 avatar Mar 22 '16 13:03 dgoldenberg1234

FTI and DP are not quite the same. DP is an auditing related use case in which we would want to record things for later processing in a write only manner. FTI is happy to loose previous information as long as it can know if the document was fully processed or needs to be reprocessed when the application restarts. I think a DP scheme will most likely involve a dedicated DP log file, but I think this can be added down the road. Converting this to an enhancement and help wanted with the intent that the first step (before code please) is to describe the intended design.

nsoft avatar Jun 12 '17 21:06 nsoft

Data provenance is a major undertaking and there are many other things to do first. I'm going to close this ticket and when the topic comes up again I or the contributor that wishes to implement it can write detailed specification in a new ticket.

nsoft avatar Feb 21 '23 18:02 nsoft