nearcore icon indicating copy to clipboard operation
nearcore copied to clipboard

fix: Handling of persisting errors in on_chunk_completed.

Open tayfunelmas opened this issue 11 months ago • 3 comments

Currently on_chunk_completed is called when a chunk is to completed. It persists the chunk and this may fail. This change is addressing a todo for more graceful handling of the error (#10569).

ATTENTION TO REVIEWERS: Sending this PR to start a discussion around the right handling of the error. Is this change that simple or am I missing something? for example, if the persisting the chunk fails, what else we need to do (or revert)?

tayfunelmas avatar Mar 21 '24 22:03 tayfunelmas

Codecov Report

Attention: Patch coverage is 46.15385% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 71.64%. Comparing base (92e5938) to head (d30d2c7). Report is 4 commits behind head on master.

Files Patch % Lines
chain/client/src/client.rs 46.15% 6 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10854      +/-   ##
==========================================
+ Coverage   71.61%   71.64%   +0.02%     
==========================================
  Files         758      758              
  Lines      151765   151924     +159     
  Branches   151765   151924     +159     
==========================================
+ Hits       108692   108840     +148     
- Misses      38537    38544       +7     
- Partials     4536     4540       +4     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) :arrow_down:
db-migration 0.24% <0.00%> (-0.01%) :arrow_down:
genesis-check 1.42% <0.00%> (-0.01%) :arrow_down:
integration-tests 37.37% <46.15%> (-0.02%) :arrow_down:
linux 70.26% <46.15%> (+0.02%) :arrow_up:
linux-nightly 71.15% <46.15%> (+0.06%) :arrow_up:
macos 54.74% <46.15%> (+0.02%) :arrow_up:
pytests 1.65% <0.00%> (-0.01%) :arrow_down:
sanity-checks 1.43% <0.00%> (-0.01%) :arrow_down:
unittests 67.30% <46.15%> (+0.03%) :arrow_up:
upgradability 0.29% <0.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 21 '24 22:03 codecov[bot]

I no longer remember the context of what we should do if this case errors. But thinking about it now, (1) I don't think the chain can continue if this persist fails, because we rely on the existence of the partial chunks in order to execute any chunk; (2) there's no reason this would fail because it is simply a store write, and the only way a store write would fail is if there's something just wrong with the db (IO error, disk full, etc.) in which case there is likely a catastrophic failure.

So, I think .expect is actually OK. @staffik do you have additional context here since you filed the issue?

I do not have additional context, just filled this issue because of this suggestion: https://github.com/near/nearcore/pull/10568#discussion_r1478384630 If the only way a store write would fail is if there's something wrong with the db then I would leave .expect here too.

staffik avatar Mar 28 '24 09:03 staffik

Based on Robin's comment, I am now also more comfortable with leaving the .expect call as is since handling of IO failure correctly may be more challenging or impossible. Adding @wacban to see if we miss anything.

tayfunelmas avatar Mar 29 '24 17:03 tayfunelmas