[dvc][doc] Create MVP for DaVinciRecordTransformer
Summary
- Added
storeRecordsInDaVinciboolean to the constructor to detect if a user wants to store their transformed records in DaVinci or another storage of their choice. - Created
DaVinciRecordTransformerConfigto be used byDaVinciConfig, so we have access to Output Value Class and Schema without needing to create a dummy record transformer. - Added a recovery method to bootstrap from VT or from local state on startup. If blob transfer is enabled, an error will be thrown since it's not currently supported with the record transformer.
- Created
AbstractStorageIteratorand a RocksDB implementation to be used by the recovery method. - Added logic to detect if transformer logic has changed on startup, utilizing the hash of the class bytecode. If transform logic has changed, that means local state is stale and we need to bootstrap directly from the VT.
- Added logic to create a DVRT specific Avro deserializer.
- Split
putintotransformandprocessPut, so that when we're trying to bootstrap from local state we don't try to transform the data again as it already has been transformed. Also created a wrapper methodtransformAndProcessPutwhen both need to be called. - Added delete logic in
StoreIngestionTask. - Updated record transformer doc with new logic.
- Made Venice repo a link in the workspace setup doc.
How was this PR tested?
Added more unit tests.
Does this PR introduce any user-facing changes?
- [X] No. You can skip the rest of this section.
- [ ] Yes. Make sure to explain your proposed changes and call out the behavior change.
The thing about doing a hash of the bytecode sounds pretty interesting, but I wonder how reliable it is... I suspect that it probably has extremely low chances of false negative (the code changed but we fail to detect it) but that it might have a somewhat large probability of false positives (the code did not change but we think it did). The latter could be due to differences in JDK versions or flags used for compilation... Maybe it's not a big deal, but IDK...
Also, coming back to false negatives, those may still exist if the scope of the checksumming is not large enough, e.g. the class we checksum did not change but it calls into another class which did change.
Overall, I am not sure how I feel about this mechanism, and I think we may want to have a config to opt out of this check. Possibly even: this should be opt in to begin with (i.e. off by default) until we can gain a better understanding of its limitations.
The thing about doing a hash of the bytecode sounds pretty interesting, but I wonder how reliable it is... I suspect that it probably has extremely low chances of false negative (the code changed but we fail to detect it) but that it might have a somewhat large probability of false positives (the code did not change but we think it did). The latter could be due to differences in JDK versions or flags used for compilation... Maybe it's not a big deal, but IDK...
Also, coming back to false negatives, those may still exist if the scope of the checksumming is not large enough, e.g. the class we checksum did not change but it calls into another class which did change.
Overall, I am not sure how I feel about this mechanism, and I think we may want to have a config to opt out of this check. Possibly even: this should be opt in to begin with (i.e. off by default) until we can gain a better understanding of its limitations.
From my testing, the hash would change under these circumstances:
- Change in JDK version.
- Any changes made within the user's implementation (I found this to be very reliable).
Circumstances where it didn't change:
- Change in version numbers of imported packages that's used by the user's implementation.
- Code changes in functions imported.
- Changes in the abstract class.
I think having it off by default and having the user enable it makes sense.
We could have a global variable that's initially set to false, and add another constructor where a user can pass in a boolean of whether or not to enable it. We could call the variable detectChangesInLogic. If it's set to false, then always bootstrap from VT. Thoughts @ZacAttack?
I have finished my first two passes over this. Good progress, but there are some significant gotchas that have to be addressed. The first and most primary one being that I do not believe this implementation will work with version pushes.