oxc icon indicating copy to clipboard operation
oxc copied to clipboard

Coworkers report that lint --fix will sometimes delete / empty / truncate a file.

Open Raynos opened this issue 1 year ago • 21 comments
trafficstars

One of the typescript files gets replaced with the empty file and all its content get deleted. This is confusing and it's unclear which oxlint rule is doing this.

Raynos avatar Sep 25 '24 22:09 Raynos

Hi Raynos, thanks for the report. Do you think you could provide me some more details to reproduce?

  1. What rules do you have enabled?
  2. What exact command are you running?
  3. Can you provide a minimal reproduction of the file?

Thanks!

DonIsaac avatar Sep 26 '24 01:09 DonIsaac

oxlint -c=./.oxlintrc.json --tsconfig=./tsconfig.json . -D correctness -D perf -D suspicious --promise-plugin --import-plugin --fix
{
  "settings": {},
  "rules": {
    "no-unused-vars": "allow",
    "no-new-array": "allow",
    "no-empty-file": "allow",
    "no-document-cookie": "allow",
    "no-this-alias": [
      "deny",
      {
        "allow_names": ["self"]
      }
    ],
    "no-await-in-loop": "allow",
    "react-in-jsx-scope": "allow",
    "consistent-function-scoping": "allow",
    "no-async-endpoint-handlers": "allow",
    "no-new": "allow",
    "no-extraneous-class": "allow"
  }
}

I have not being able to reproduce this locally but I've asked my coworkers for a reproduction uasecase.

Raynos avatar Sep 26 '24 01:09 Raynos

@Raynos thanks for the details. I will start debugging this once I have a file I can reproduce on.

DonIsaac avatar Sep 26 '24 16:09 DonIsaac

one file that disappeared is workspaces/metrics-exporter/src/metadataQueue.ts when I hadn't touched anything in that workspace

coworker mentioned a file was truncated in a directory he has not touched on his branch.

Raynos avatar Sep 26 '24 17:09 Raynos

and it's flaky hard to reproduce.

Raynos avatar Sep 26 '24 17:09 Raynos

Happened to me too, it randomly deleted

workspaces/pipeline/src/memo/ram-memo.ts
```

Raynos avatar Sep 26 '24 18:09 Raynos

I'm running it in a loop and it deleted 5/6 random files :(

Raynos avatar Sep 26 '24 18:09 Raynos

https://github.com/PabloSzx/graphql-ez/tree/jake/reproduction

And

$ while true ; do npm run lint:fix ; done

Will reproduce it ; you should run pnpm install in this repo as its pnpm; not npm.

Raynos avatar Sep 26 '24 18:09 Raynos

@Raynos what OS are you and your colleagues using?

DonIsaac avatar Sep 26 '24 23:09 DonIsaac

...and is this how you guys run oxlint??

DonIsaac avatar Sep 26 '24 23:09 DonIsaac

Tl;Dr

This was weird. I think I found a way to make this stop happening, but I could not figure out or solve the underlying issue. Either way, I'll have a PR out after I finish writing this out.

Longer Version

So this bug was weird. My initial thought was "oh, maybe a deletion fix is getting applied to the whole file, or there's a bug in the fix merging or something". So I added an assertion after the new source file has been built.

// oxc_linter/src/service.rs line 268
let fix_result = Fixer::new(source_text, messages).fix();
assert(!fix_result.fixed_code.is_empty();

The assertion never panicked, and files kept getting deleted. What gives?

My next thought is that maybe something isn't getting flushed somewhere. So I replaced the line that writes the fixed source

// before
fs::write(path, fix_result.fixed_code.as_bytes()).unwrap()

With something that forces a blocking write-through to disk, using man 2 fsync

let mut file = File::create(path).unwrap();
file.write_all(fix_result.fixed_code.as_bytes()).unwrap();
file.flush().unwrap(); // for good measure
file.sync_all().unwrap();

When I tried that, oxlint went from 20m to 200ms, but still files kept getting deleted. And here is where I got confused.

This lead me down a rabbit hole of the weirdness and brokenness of close, especially in Rust, where close will silently fail (for example, if a system interrupt is sent while you're trying to close).

So what if we closed it ourselves?

unsafe {
  let exit = libc::close(file.into_raw_fd());
  assert_eq!(exit, 0);
}

And you know what? THAT DIDN'T WORK EITHER! can you tell this is driving me a little mad?

Finally, thankfully, I found a workable solution. It does not address any of the above-mentioned issues or weirdness, but it should work. Basically, the lint Runtime is always writing fixed content, _even if no fixes have been applied. This basically means that if you have --fix set, all files will have their content re-written whether or not any changes were actually made. This seems to solve the problem.

Sorry for the long reply, that was 2 hours of my life and I had to get it off my chest.

DonIsaac avatar Sep 27 '24 01:09 DonIsaac

@Raynos what OS are you and your colleagues using?

Mostly macintosh / macbooks, maybe one ubuntu linux user.

We don't run oxlint in a loop but yes the npm run lint:fix command is what we are running.

Raynos avatar Sep 27 '24 01:09 Raynos

I used the loop example to eventually reproduce the flaky behavior of file truncation.

Raynos avatar Sep 27 '24 01:09 Raynos

So i'm curious what happens if you do apply the fix to all files, but at the end of the program, instead of exiting the cli, you sleep for 10milliseconds.

Whether that will stop the behavior of empty buffer being written to the files.

Like is the problem here that writing the same content to the file plus flaky close behavior plus an efficient binary that at the end of the program writes to all the files, writes to stdout and exits the process cleanly leaves some file descriptors in a nonflushed / bad state.

Do you think your fix of not writing to the file if no fixes were applied will cause this bug frequency to go from "1 in 10 runs" to "1 in 10,000 runs" ?

I've never seen file writing operations get corrupted like this before, it seems like such a trustworthy battle hardened piece of the OS...

Raynos avatar Sep 27 '24 01:09 Raynos

Have you looked whether the issue is the non-atomic nature of fs::write ? https://github.com/rust-lang/rust/issues/82590

Could it be that multiple rules are trying to concurrently fix a file and write the original content back to the file and that causes the truncation ?

Would replacing the fs::write with fs::write to temp file and then an atomic file rename cause the problem to go away ? |

Edit: oh ffs, file renames are only atomic on linux, not on macos,

Raynos avatar Sep 27 '24 01:09 Raynos

So i'm curious what happens if you do apply the fix to all files, but at the end of the program, instead of exiting the cli, you sleep for 10milliseconds.

Whether that will stop the behavior of empty buffer being written to the files.

Like is the problem here that writing the same content to the file plus flaky close behavior plus an efficient binary that at the end of the program writes to all the files, writes to stdout and exits the process cleanly leaves some file descriptors in a nonflushed / bad state.

Do you think your fix of not writing to the file if no fixes were applied will cause this bug frequency to go from "1 in 10 runs" to "1 in 10,000 runs" ?

I've never seen file writing operations get corrupted like this before, it seems like such a trustworthy battle hardened piece of the OS...

I added a sleep 1 command inside the loop and saw this:

  1. With sync_all, no corruption occurred.
  2. Without sync_all some corruption still occurred, but less

Unfortunately the cost of syncing, at around 10x the runtime performance, is simply too high. Note that syncing without a sleep still causes corruption.

DonIsaac avatar Sep 27 '24 03:09 DonIsaac

Despite being fixed, we need an integration test to catch future regressions.

Boshen avatar Sep 27 '24 06:09 Boshen

I have no idea how to reproduce it consistently in an integration test

DonIsaac avatar Sep 27 '24 21:09 DonIsaac

if you want a slow integration test you can spawn oxlint as a child process in a loop with a timeout of 5s and assert no files were truncated.

You can try to revert the fix locally and see if 5s is enough looping to reproduce it ±50% of the time.

Raynos avatar Sep 30 '24 19:09 Raynos

For flaky behavior its basically impossible to write a consistent integration test, your just writing a test that will hopefully catch regressions on every second commit on average.

Raynos avatar Sep 30 '24 19:09 Raynos

@Boshen, do you think writing a 5s loop regression test is worth it?

DonIsaac avatar Oct 03 '24 18:10 DonIsaac

@Boshen, do you think writing a 5s loop regression test is worth it?

Probably, leave this to me, I'll think about this issue.

Boshen avatar Oct 04 '24 15:10 Boshen

Added a test in https://github.com/oxc-project/oxc/pull/6747, the test checks the modified timestamp to ensure file is not written when no fix is applied.

Boshen avatar Oct 21 '24 15:10 Boshen