dagu icon indicating copy to clipboard operation
dagu copied to clipboard

solving #970 and updating the doc regarding wrong mention of exitCode

Open kriyanshii opened this issue 6 months ago • 7 comments

Fix: Improve Log File Handling During Retries

Problem

When a DAG step is retried, the log files were not being properly handled, leading to "file already closed" errors. This occurred because:

  1. The same log file was being reused across retries
  2. File handles were being closed after the first attempt
  3. Subsequent retries couldn't write to the closed file

Solution

Modified the log file handling in internal/digraph/scheduler/node.go to:

  1. Create unique log files for each retry attempt
  2. Append retry count to log filenames (e.g., .retry1, .retry2)
  3. Ensure proper file handle management during retries

Changes

  • Updated Setup method to include retry count in log filenames
  • Added retry suffix format: .retryN where N is the retry count
  • Maintained backward compatibility for non-retry cases

Testing

  • Verified that retries now create separate log files
  • Confirmed no "file already closed" errors during retries
  • Tested with multiple retry attempts
  • Ensured backward compatibility for non-retry cases

Impact

  • Improved reliability of retry mechanism
  • Better debugging capabilities with separate logs for each retry
  • Clearer history of retry attempts
  • No changes to existing functionality for non-retry cases

Related Issues

Closes #[970]

kriyanshii avatar May 31 '25 19:05 kriyanshii

Retry should be happening only after last retry is complete. I am not sure if it is a race condition. If each retry step is closing log file, we should try to reopen it in append mode if it exists (in subsequent retries) or create it (in first retry). @yottahmd What was original behavior before this bug got in?

You can also attempt this solution:

if n.GetRetryCount() > 0 and exitCode is in the exitCodesList { //don't close the log file at following line: https://github.com/dagu-org/dagu/blob/9c2c1633a935133dd172fcacc8696fcca862b53f/internal/digraph/scheduler/node.go#L441 }

We need to pass on the exitCode returned from last retry to tearDown function. And this bug exists in 1.16.x

ghansham avatar Jun 01 '25 00:06 ghansham

Hi @ghansham, @kriyanshii — thanks for investigating this! 🙏

I think @ghansham is right. It seems like a bug that the log file gets closed on each retry. We should ensure that it stays open and doesn’t get closed with each retry (It was the original behavior).

yottahmd avatar Jun 01 '25 01:06 yottahmd

I think we can just remove the code below. @kriyanshii Can you please update the PR? https://github.com/dagu-org/dagu/blob/4dfbec53d0093333bce71cb3033a3541ef15dda5/internal/digraph/scheduler/scheduler.go#L252-L255

yottahmd avatar Jun 01 '25 01:06 yottahmd

Very cool

ghansham avatar Jun 01 '25 01:06 ghansham

@yottahmd I have updated the PR. Please review.

kriyanshii avatar Jun 01 '25 14:06 kriyanshii

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.

Project coverage is 61.58%. Comparing base (73f500c) to head (f422f69).

Files with missing lines Patch % Lines
internal/digraph/scheduler/node.go 78.57% 2 Missing and 1 partial :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #971      +/-   ##
==========================================
- Coverage   61.59%   61.58%   -0.01%     
==========================================
  Files          85       85              
  Lines       10854    10861       +7     
==========================================
+ Hits         6685     6689       +4     
- Misses       3485     3487       +2     
- Partials      684      685       +1     
Files with missing lines Coverage Δ
internal/digraph/scheduler/scheduler.go 76.64% <ø> (-0.14%) :arrow_down:
internal/digraph/scheduler/node.go 64.64% <78.57%> (+0.09%) :arrow_up:

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 73f500c...f422f69. Read the comment docs.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jun 01 '25 14:06 codecov[bot]

It's better to keep a single log file but at the start of every retry we should print some message mentioning retry no. X (Start Time: ......)

ghansham avatar Jun 02 '25 02:06 ghansham

Apologies for leaving the PR open. Thank you for all your hard work on this issue —I really appreciate it, even though it hasn’t been merged. If any other issues catch your eye, please feel free to pick them up!

yottahmd avatar Jun 27 '25 14:06 yottahmd