ozone
ozone copied to clipboard
HDDS-10295. Provide an "ozone repair" subcommand to update the snapshot info in transactionInfoTable
What changes were proposed in this pull request?
The issue found in HDDS-9342 caused the snapshot info in OM transactionInfoTable not get updated timely, so that OM restart failed at update ID check during raft log reapply.
The recover solution is to find the largest update ID, and update the snapshot info in transactionInfoTable with this it.
The task aims to provide such an CLI to update the table. Be noted, the largest update ID and its term currently should still need manual find.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10295
How was this patch tested?
Integration test
cc. @ChenSammi @szetszwo @errose28 @adoroszlai
The task aims to provide such an CLI to update the table. Be noted, the largest update ID and its term currently should still need manual find.
Since this is an offline CLI I think it should also support finding the largest updateID (even if it's slow) and doing the update. Maybe as two steps (one to find the largest ID, and another to update to that). Doing the repair incorrectly can result in some bad states and we should try to make the repair commands as safe as possible. @ChenSammi or @fapifta can probably confirm what the correct steps to do the repair are since I haven't actually manually repaired a DB from this bug myself. I think scanning the DB for largest update ID will give the correct number to set the transaction index to.
The task aims to provide such an CLI to update the table. Be noted, the largest update ID and its term currently should still need manual find.
Since this is an offline CLI I think it should also support finding the largest updateID (even if it's slow) and doing the update. Maybe as two steps (one to find the largest ID, and another to update to that). Doing the repair incorrectly can result in some bad states and we should try to make the repair commands as safe as possible. @ChenSammi or @fapifta can probably confirm what the correct steps to do the repair are since I haven't actually manually repaired a DB from this bug myself. I think scanning the DB for largest update ID will give the correct number to set the transaction index to.
hey @ChenSammi @szetszwo , I look at the previous jira https://issues.apache.org/jira/browse/HDDS-9342, but I'm still not sure what's the best way to retrieve the highest TermIndex, except checking om's log. I see that two maps of 'applyTransactionMap' and 'ratisTransactionMap' have been removed from om, which might contain that information. so do you know where we could retrieve that TermIndex information, other than looking at om's log? Thanks!
... two maps of 'applyTransactionMap' and 'ratisTransactionMap' have been removed from om ...
@DaveTeng0 , since this is an offline CLI, there is no OM running and these two maps are not available even if there were not removed.
... except checking om's log. ...
I guess you mean OM raft log? It also cannot be used since the log entries may or may not be applied.
The correct way is to fine the highest index from RocksDB. This should be what @errose28 has suggested.
... two maps of 'applyTransactionMap' and 'ratisTransactionMap' have been removed from om ...
@DaveTeng0 , since this is an offline CLI, there is no OM running and these two maps are not available even if there were not removed.
oh! that's right!
... except checking om's log. ...
I guess you mean OM raft log? It also cannot be used since the log entries may or may not be applied.
The correct way is to fine the highest index from RocksDB. This should be what @errose28 has suggested.
created a jira to investigate how to parse all RocksDB files to get latest highest TermIndex of OM. HDDS-10730
Hello! if no further new comments, please feel free to merge! Thanks!
@DaveTeng0 TestTransactionInfoRepair tests are failing due to NPE. Can you please fix that?
@errose28 can you please take a look at the final PR?
Thanks, @DaveTeng0 for the change and @errose28 for the review.
Thanks @hemantk-12 , @errose28 !