ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-7281. [Snapshot] Handle RocksDB compaction DAG persistence and reconstruction

Open smengcl opened this issue 2 years ago • 3 comments

This PR rebased #3786, changed the first commit title (so there is less chance of merging this patch with incorrect commit message), switched branch name (as the dev branch cannot be changed in the original PR), and added more commits.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7281

What changes were proposed in this pull request?

  • Register custom onCompactionBegin and onCompactionCompleted event listeners that is attached on OM startup RDB open that tracks RocksDB compaction. (partially done in HDDS-7224)
  • The onCompactionBegin listener saves the copies of the compaction input files (by hardlinking them in a separate directory) that are about to be deleted by RDB compaction logic. (initially done in HDDS-7224)
  • The onCompactionCompleted listener persists the RDB compaction input SST files and output SST file names to a plain text file (compaction log) on each compaction. The compaction log file name itself takes the RDB latest sequence number (with padded zeros) in hope of achieving monotonically increasing order when the files are listed and sorted alphabetically.
    • This leaves room for optimization on compaction log read logic if necessary.
    • This implies that on a OM metadata reset, the compaction log dir must be purged as well, otherwise it would confuse the differ.
    • A new compaction log file is created every time the differ is initialized (e.g. during OM startup / re-initialization), or when a new OM snapshot is created. The reason of not using a single large file to log the compaction is that it makes future garbage collection implementation harder (and more inefficient) because there isn't a general way to truncate the beginning chunk of a file.
  • RocksDBCheckpointDiffer#printSnapdiffSSTFiles can now load the compaction log in order when some SST files are not found as a node in the DAG. Then it could return the list of different SST files to be further processed by RocksDiff (to speed up OM snapshot diff operation). If the compaction log is missing or corrupted, it should report such error and expect RocksDiff to fall back to a slow full diff as the last resort.

TODO

  • [x] Refactor RocksDBCheckpointDiffer to reduce duplication between two setRocksDBForCompactionTracking methods (one for standalone testing and one for OM since they are using different RDB options right now)
  • [ ] ~~Use https://github.com/apache/ozone/pull/3658 to locate the from (src) and to (dest) snapshots in the snapshot chain. It should be much more efficient to locate the snapshot in the chain this way~~. No longer applicable with the latest implementation where it doesn't care about the snapshot chain anymore. -- It simply reads from the first compaction log that exists in the directory until it loads and finds all the SST files.
  • [ ] Address all TODOs added in this PR.

How was this patch tested?

  • [x] Standalone UT TestRocksDBCheckpointDiffer (that tests with its own RDB) works as expected.
  • [ ] Integration test TestOMSnapshotDAG. testZeroSizeKey can trigger the event listeners correctly.

smengcl avatar Oct 12 '22 02:10 smengcl

These lines are confusing to me: https://github.com/apache/ozone/blob/c7e922fb0c4d4c6c312bd717246c9fd8fde27e6f/hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java#L734-L738

I think something like this would be clearer:

      Set<CompactionNode> currentLevel = new HashSet<>();
      currentLevel.add(infileNode);
      Set<CompactionNode> nextLevel = new HashSet<>();

GeorgeJahad avatar Oct 12 '22 21:10 GeorgeJahad

This is in response to the discussion here: https://github.com/apache/ozone/pull/3786#discussion_r992935645

To me there is value in the conceptual simplicity of having all the state in rocksdb.

And I don't think it would have to be a new RDB instance. It could just be a separate CF in the existing one. I know that that CF wouldn't be shared through ratis transactions, but it would be shared with followers when they get bootstrapped (which happens to be what we want.)

Finally, with respect to the ordering problem mentioned above, couldn't the db lastSeqNum be used as the rocksdb key, (as it is used in the filename currently)?

I don't feel too strongly about it, and will understand if you think it is not worth the effort. But it does seem like a cleaner approach to me.

GeorgeJahad avatar Oct 12 '22 22:10 GeorgeJahad

This is in response to the discussion here: #3786 (comment)

To me there is value in the conceptual simplicity of having all the state in rocksdb.

And I don't think it would have to be a new RDB instance. It could just be a separate CF in the existing one. I know that that CF wouldn't be shared through ratis transactions, but it would be shared with followers when they get bootstrapped (which happens to be what we want.)

Doesn't bootstrapping pack the entire ozone-metadata directory? If this is the case then the new compaction-log directory would have been included automatically. (I need to double check.)

I understand it would be easier (and cleaner) if we just use a new CF in the existing OM RDB to store compaction history. The implementation would be trivial. But there are some aspects I dislike about it:

  1. The info we are storing to the DB (input / output SST file pairs) tightly relates to the file structure of the database itself, which isn't really related to main functions of OM, and can differ from OM to OM. Plus the fact that the compaction log isn't crucial for the RocksDiff feature to work (as it is meant for acceleration by returning only the relevant set of SST files). If we do want to persist the log in RDB I would prefer starting a new RDB instance just for this (a metadata DB of the existing DB).
  2. We are putting extra pressure on RDB compaction by writing those extra K-V pairs to the DB itself after each compaction. Though the pressure should be small provided that compaction doesn't happen too frequently (depends on DB write and DB options).

Finally, with respect to the ordering problem mentioned above, couldn't the db lastSeqNum be used as the rocksdb key, (as it is used in the filename currently)?

Good point.

smengcl avatar Oct 12 '22 23:10 smengcl

Current compaction log format is input files newline output files

Its a very risky format. Lets change it to something like

C(compaction log entry) $ input-file1, i2, i3 : outputfile-1, o2 o3
S(Snapshot creation entry) $ Snapshot number
C(compaction log entry) $ input-file1, i2, i3 : outputfile-1, o2 o3

But this can be done in a separate PR also and a separate Jira. Lets get this one in as it is working. All improvements can happen incrementally.

prashantpogde avatar Oct 25 '22 21:10 prashantpogde

This PR is ready for another round of review.

UT TestRocksDBCheckpointDiffer output for reference:

2022-11-01 22:22:44,258 [main] INFO  rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-1': [000013, 000022, 000011, 000018, 000015, 000020]
2022-11-01 22:22:44,261 [main] INFO  rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-50000': [000013, 000022, 000018, 000015, 000020]
2022-11-01 22:22:44,263 [main] INFO  rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-99999': [000022, 000018, 000015, 000020]
2022-11-01 22:22:44,265 [main] INFO  rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-149998': [000022, 000018, 000020]
2022-11-01 22:22:44,267 [main] INFO  rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-199997': [000022, 000020]
2022-11-01 22:22:44,269 [main] INFO  rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-249996': [000022]
2022-11-01 22:22:44,272 [main] INFO  rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-250000': []

smengcl avatar Nov 02 '22 05:11 smengcl

LGTM for this PR. Additional set of changes that are highlighted in this PR can also go in the next PR.

prashantpogde avatar Nov 03 '22 01:11 prashantpogde

I will merge this in a minute.

smengcl avatar Nov 03 '22 16:11 smengcl

Thanks @GeorgeJahad @sadanand48 @prashantpogde @hemantk-12 for reviewing this.

smengcl avatar Nov 03 '22 16:11 smengcl