ozone icon indicating copy to clipboard operation
ozone copied to clipboard

HDDS-10295. Provide an "ozone repair" subcommand to update the snapshot info in transactionInfoTable

Open DaveTeng0 opened this issue 10 months ago • 7 comments

What changes were proposed in this pull request?

The issue found in HDDS-9342 caused the snapshot info in OM transactionInfoTable not get updated timely, so that OM restart failed at update ID check during raft log reapply.

The recover solution is to find the largest update ID, and update the snapshot info in transactionInfoTable with this it.

The task aims to provide such an CLI to update the table. Be noted, the largest update ID and its term currently should still need manual find.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10295

How was this patch tested?

Integration test

DaveTeng0 avatar Apr 15 '24 22:04 DaveTeng0

cc. @ChenSammi @szetszwo @errose28 @adoroszlai

DaveTeng0 avatar Apr 15 '24 22:04 DaveTeng0

The task aims to provide such an CLI to update the table. Be noted, the largest update ID and its term currently should still need manual find.

Since this is an offline CLI I think it should also support finding the largest updateID (even if it's slow) and doing the update. Maybe as two steps (one to find the largest ID, and another to update to that). Doing the repair incorrectly can result in some bad states and we should try to make the repair commands as safe as possible. @ChenSammi or @fapifta can probably confirm what the correct steps to do the repair are since I haven't actually manually repaired a DB from this bug myself. I think scanning the DB for largest update ID will give the correct number to set the transaction index to.

errose28 avatar Apr 16 '24 17:04 errose28

The task aims to provide such an CLI to update the table. Be noted, the largest update ID and its term currently should still need manual find.

Since this is an offline CLI I think it should also support finding the largest updateID (even if it's slow) and doing the update. Maybe as two steps (one to find the largest ID, and another to update to that). Doing the repair incorrectly can result in some bad states and we should try to make the repair commands as safe as possible. @ChenSammi or @fapifta can probably confirm what the correct steps to do the repair are since I haven't actually manually repaired a DB from this bug myself. I think scanning the DB for largest update ID will give the correct number to set the transaction index to.

hey @ChenSammi @szetszwo , I look at the previous jira https://issues.apache.org/jira/browse/HDDS-9342, but I'm still not sure what's the best way to retrieve the highest TermIndex, except checking om's log. I see that two maps of 'applyTransactionMap' and 'ratisTransactionMap' have been removed from om, which might contain that information. so do you know where we could retrieve that TermIndex information, other than looking at om's log? Thanks!

DaveTeng0 avatar Apr 17 '24 17:04 DaveTeng0

... two maps of 'applyTransactionMap' and 'ratisTransactionMap' have been removed from om ...

@DaveTeng0 , since this is an offline CLI, there is no OM running and these two maps are not available even if there were not removed.

szetszwo avatar Apr 17 '24 20:04 szetszwo

... except checking om's log. ...

I guess you mean OM raft log? It also cannot be used since the log entries may or may not be applied.

The correct way is to fine the highest index from RocksDB. This should be what @errose28 has suggested.

szetszwo avatar Apr 17 '24 20:04 szetszwo

... two maps of 'applyTransactionMap' and 'ratisTransactionMap' have been removed from om ...

@DaveTeng0 , since this is an offline CLI, there is no OM running and these two maps are not available even if there were not removed.

oh! that's right!

DaveTeng0 avatar Apr 17 '24 20:04 DaveTeng0

... except checking om's log. ...

I guess you mean OM raft log? It also cannot be used since the log entries may or may not be applied.

The correct way is to fine the highest index from RocksDB. This should be what @errose28 has suggested.

created a jira to investigate how to parse all RocksDB files to get latest highest TermIndex of OM. HDDS-10730

DaveTeng0 avatar Apr 21 '24 22:04 DaveTeng0

Hello! if no further new comments, please feel free to merge! Thanks!

DaveTeng0 avatar May 14 '24 18:05 DaveTeng0

@DaveTeng0 TestTransactionInfoRepair tests are failing due to NPE. Can you please fix that?

hemantk-12 avatar May 14 '24 20:05 hemantk-12

@errose28 can you please take a look at the final PR?

hemantk-12 avatar May 28 '24 22:05 hemantk-12

Thanks, @DaveTeng0 for the change and @errose28 for the review.

hemantk-12 avatar Jun 12 '24 02:06 hemantk-12

Thanks @hemantk-12 , @errose28 !

DaveTeng0 avatar Jun 12 '24 13:06 DaveTeng0