ozone
ozone copied to clipboard
HDDS-7281. [Snapshot] Handle RocksDB compaction DAG persistence and reconstruction
This PR rebased #3786, changed the first commit title (so there is less chance of merging this patch with incorrect commit message), switched branch name (as the dev branch cannot be changed in the original PR), and added more commits.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7281
What changes were proposed in this pull request?
- Register custom
onCompactionBegin
andonCompactionCompleted
event listeners that is attached on OM startup RDB open that tracks RocksDB compaction. (partially done in HDDS-7224) - The
onCompactionBegin
listener saves the copies of the compaction input files (by hardlinking them in a separate directory) that are about to be deleted by RDB compaction logic. (initially done in HDDS-7224) - The
onCompactionCompleted
listener persists the RDB compaction input SST files and output SST file names to a plain text file (compaction log) on each compaction. The compaction log file name itself takes the RDB latest sequence number (with padded zeros) in hope of achieving monotonically increasing order when the files are listed and sorted alphabetically.- This leaves room for optimization on compaction log read logic if necessary.
- This implies that on a OM metadata reset, the compaction log dir must be purged as well, otherwise it would confuse the differ.
- A new compaction log file is created every time the differ is initialized (e.g. during OM startup / re-initialization), or when a new OM snapshot is created. The reason of not using a single large file to log the compaction is that it makes future garbage collection implementation harder (and more inefficient) because there isn't a general way to truncate the beginning chunk of a file.
-
RocksDBCheckpointDiffer#printSnapdiffSSTFiles
can now load the compaction log in order when some SST files are not found as a node in the DAG. Then it could return the list of different SST files to be further processed byRocksDiff
(to speed up OM snapshot diff operation). If the compaction log is missing or corrupted, it should report such error and expectRocksDiff
to fall back to a slow full diff as the last resort.
TODO
- [x] Refactor
RocksDBCheckpointDiffer
to reduce duplication between twosetRocksDBForCompactionTracking
methods (one for standalone testing and one for OM since they are using different RDB options right now) - [ ] ~~Use https://github.com/apache/ozone/pull/3658 to locate the
from (src)
andto (dest)
snapshots in the snapshot chain. It should be much more efficient to locate the snapshot in the chain this way~~. No longer applicable with the latest implementation where it doesn't care about the snapshot chain anymore. -- It simply reads from the first compaction log that exists in the directory until it loads and finds all the SST files. - [ ] Address all TODOs added in this PR.
How was this patch tested?
- [x] Standalone UT
TestRocksDBCheckpointDiffer
(that tests with its own RDB) works as expected. - [ ] Integration test
TestOMSnapshotDAG
.testZeroSizeKey
can trigger the event listeners correctly.
These lines are confusing to me: https://github.com/apache/ozone/blob/c7e922fb0c4d4c6c312bd717246c9fd8fde27e6f/hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java#L734-L738
I think something like this would be clearer:
Set<CompactionNode> currentLevel = new HashSet<>();
currentLevel.add(infileNode);
Set<CompactionNode> nextLevel = new HashSet<>();
This is in response to the discussion here: https://github.com/apache/ozone/pull/3786#discussion_r992935645
To me there is value in the conceptual simplicity of having all the state in rocksdb.
And I don't think it would have to be a new RDB instance. It could just be a separate CF in the existing one. I know that that CF wouldn't be shared through ratis transactions, but it would be shared with followers when they get bootstrapped (which happens to be what we want.)
Finally, with respect to the ordering problem mentioned above, couldn't the db lastSeqNum be used as the rocksdb key, (as it is used in the filename currently)?
I don't feel too strongly about it, and will understand if you think it is not worth the effort. But it does seem like a cleaner approach to me.
This is in response to the discussion here: #3786 (comment)
To me there is value in the conceptual simplicity of having all the state in rocksdb.
And I don't think it would have to be a new RDB instance. It could just be a separate CF in the existing one. I know that that CF wouldn't be shared through ratis transactions, but it would be shared with followers when they get bootstrapped (which happens to be what we want.)
Doesn't bootstrapping pack the entire ozone-metadata
directory? If this is the case then the new compaction-log
directory would have been included automatically. (I need to double check.)
I understand it would be easier (and cleaner) if we just use a new CF in the existing OM RDB to store compaction history. The implementation would be trivial. But there are some aspects I dislike about it:
- The info we are storing to the DB (input / output SST file pairs) tightly relates to the file structure of the database itself, which isn't really related to main functions of OM, and can differ from OM to OM. Plus the fact that the compaction log isn't crucial for the RocksDiff feature to work (as it is meant for acceleration by returning only the relevant set of SST files). If we do want to persist the log in RDB I would prefer starting a new RDB instance just for this (a metadata DB of the existing DB).
- We are putting extra pressure on RDB compaction by writing those extra K-V pairs to the DB itself after each compaction. Though the pressure should be small provided that compaction doesn't happen too frequently (depends on DB write and DB options).
Finally, with respect to the ordering problem mentioned above, couldn't the db lastSeqNum be used as the rocksdb key, (as it is used in the filename currently)?
Good point.
Current compaction log format is input files newline output files
Its a very risky format. Lets change it to something like
C(compaction log entry) $ input-file1, i2, i3 : outputfile-1, o2 o3
S(Snapshot creation entry) $ Snapshot number
C(compaction log entry) $ input-file1, i2, i3 : outputfile-1, o2 o3
But this can be done in a separate PR also and a separate Jira. Lets get this one in as it is working. All improvements can happen incrementally.
This PR is ready for another round of review.
UT TestRocksDBCheckpointDiffer
output for reference:
2022-11-01 22:22:44,258 [main] INFO rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-1': [000013, 000022, 000011, 000018, 000015, 000020]
2022-11-01 22:22:44,261 [main] INFO rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-50000': [000013, 000022, 000018, 000015, 000020]
2022-11-01 22:22:44,263 [main] INFO rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-99999': [000022, 000018, 000015, 000020]
2022-11-01 22:22:44,265 [main] INFO rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-149998': [000022, 000018, 000020]
2022-11-01 22:22:44,267 [main] INFO rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-199997': [000022, 000020]
2022-11-01 22:22:44,269 [main] INFO rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-249996': [000022]
2022-11-01 22:22:44,272 [main] INFO rocksdiff.TestRocksDBCheckpointDiffer (TestRocksDBCheckpointDiffer.java:diffAllSnapshots(170)) - SST diff from 'rocksdb-cp-250000' to 'rocksdb-cp-250000': []
LGTM for this PR. Additional set of changes that are highlighted in this PR can also go in the next PR.
I will merge this in a minute.
Thanks @GeorgeJahad @sadanand48 @prashantpogde @hemantk-12 for reviewing this.