venice icon indicating copy to clipboard operation
venice copied to clipboard

[dvc][doc] Create MVP for DaVinciRecordTransformer

Open kvargha opened this issue 1 year ago • 3 comments

Summary

  1. Added storeRecordsInDaVinci boolean to the constructor to detect if a user wants to store their transformed records in DaVinci or another storage of their choice.
  2. Created DaVinciRecordTransformerConfig to be used by DaVinciConfig, so we have access to Output Value Class and Schema without needing to create a dummy record transformer.
  3. Added a recovery method to bootstrap from VT or from local state on startup. If blob transfer is enabled, an error will be thrown since it's not currently supported with the record transformer.
  4. Created AbstractStorageIterator and a RocksDB implementation to be used by the recovery method.
  5. Added logic to detect if transformer logic has changed on startup, utilizing the hash of the class bytecode. If transform logic has changed, that means local state is stale and we need to bootstrap directly from the VT.
  6. Added logic to create a DVRT specific Avro deserializer.
  7. Split put into transform and processPut, so that when we're trying to bootstrap from local state we don't try to transform the data again as it already has been transformed. Also created a wrapper method transformAndProcessPut when both need to be called.
  8. Added delete logic in StoreIngestionTask.
  9. Updated record transformer doc with new logic.
  10. Made Venice repo a link in the workspace setup doc.

How was this PR tested?

Added more unit tests.

Does this PR introduce any user-facing changes?

  • [X] No. You can skip the rest of this section.
  • [ ] Yes. Make sure to explain your proposed changes and call out the behavior change.

kvargha avatar Jul 30 '24 16:07 kvargha

The thing about doing a hash of the bytecode sounds pretty interesting, but I wonder how reliable it is... I suspect that it probably has extremely low chances of false negative (the code changed but we fail to detect it) but that it might have a somewhat large probability of false positives (the code did not change but we think it did). The latter could be due to differences in JDK versions or flags used for compilation... Maybe it's not a big deal, but IDK...

Also, coming back to false negatives, those may still exist if the scope of the checksumming is not large enough, e.g. the class we checksum did not change but it calls into another class which did change.

Overall, I am not sure how I feel about this mechanism, and I think we may want to have a config to opt out of this check. Possibly even: this should be opt in to begin with (i.e. off by default) until we can gain a better understanding of its limitations.

FelixGV avatar Jul 30 '24 19:07 FelixGV

The thing about doing a hash of the bytecode sounds pretty interesting, but I wonder how reliable it is... I suspect that it probably has extremely low chances of false negative (the code changed but we fail to detect it) but that it might have a somewhat large probability of false positives (the code did not change but we think it did). The latter could be due to differences in JDK versions or flags used for compilation... Maybe it's not a big deal, but IDK...

Also, coming back to false negatives, those may still exist if the scope of the checksumming is not large enough, e.g. the class we checksum did not change but it calls into another class which did change.

Overall, I am not sure how I feel about this mechanism, and I think we may want to have a config to opt out of this check. Possibly even: this should be opt in to begin with (i.e. off by default) until we can gain a better understanding of its limitations.

From my testing, the hash would change under these circumstances:

  1. Change in JDK version.
  2. Any changes made within the user's implementation (I found this to be very reliable).

Circumstances where it didn't change:

  1. Change in version numbers of imported packages that's used by the user's implementation.
  2. Code changes in functions imported.
  3. Changes in the abstract class.

I think having it off by default and having the user enable it makes sense.

We could have a global variable that's initially set to false, and add another constructor where a user can pass in a boolean of whether or not to enable it. We could call the variable detectChangesInLogic. If it's set to false, then always bootstrap from VT. Thoughts @ZacAttack?

kvargha avatar Jul 30 '24 21:07 kvargha

I have finished my first two passes over this. Good progress, but there are some significant gotchas that have to be addressed. The first and most primary one being that I do not believe this implementation will work with version pushes.

ZacAttack avatar Aug 12 '24 21:08 ZacAttack