firecracker icon indicating copy to clipboard operation
firecracker copied to clipboard

test(jailer): enable multi-threaded test

Open cm-iwata opened this issue 1 year ago • 9 comments

Changes

Previously, all tests shared same temporary file/directory, causing concurrency conflicts when running tests in multi-threaded. Resolved test concurrency issues by incorporating random strings into file/directory names.

Reason

Fixes #4412

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check CONTRIBUTING.md.

PR Checklist

  • [x] If a specific issue led to this PR, this PR closes the issue.
  • [x] The description of changes is clear and encompassing.
  • [ ] Any required documentation changes (code and docs) are included in this PR.
  • [ ] API changes follow the Runbook for Firecracker API changes.
  • [ ] User-facing changes are mentioned in CHANGELOG.md.
  • [ ] All added/changed functionality is tested.
  • [ ] New TODOs link to an issue.
  • [x] Commits meet contribution quality standards.

  • [ ] This functionality cannot be added in rust-vmm.

cm-iwata avatar Feb 09 '24 03:02 cm-iwata

Codecov Report

Attention: Patch coverage is 96.29630% with 2 lines in your changes missing coverage. Please review.

Project coverage is 82.09%. Comparing base (6d6caf4) to head (7bd4d35).

Files Patch % Lines
src/jailer/src/env.rs 66.66% 1 Missing :warning:
src/jailer/src/main.rs 0.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4438      +/-   ##
==========================================
+ Coverage   82.08%   82.09%   +0.01%     
==========================================
  Files         255      255              
  Lines       31257    31296      +39     
==========================================
+ Hits        25656    25694      +38     
- Misses       5601     5602       +1     
Flag Coverage Δ
4.14-c5n.metal 79.59% <96.29%> (+0.02%) :arrow_up:
4.14-m5n.metal 79.57% <96.29%> (+0.02%) :arrow_up:
4.14-m6a.metal 78.80% <96.29%> (+0.02%) :arrow_up:
4.14-m6g.metal 76.63% <96.29%> (+0.03%) :arrow_up:
4.14-m6i.metal 79.57% <96.29%> (+0.01%) :arrow_up:
4.14-m7g.metal 76.63% <96.29%> (+0.03%) :arrow_up:
5.10-c5n.metal 82.11% <96.29%> (+0.02%) :arrow_up:
5.10-m5n.metal 82.09% <96.29%> (+0.01%) :arrow_up:
5.10-m6a.metal 81.41% <96.29%> (+0.02%) :arrow_up:
5.10-m6g.metal 79.40% <96.29%> (+0.02%) :arrow_up:
5.10-m6i.metal 82.09% <96.29%> (+0.02%) :arrow_up:
5.10-m7g.metal 79.40% <96.29%> (+0.02%) :arrow_up:
6.1-c5n.metal 82.11% <96.29%> (+0.02%) :arrow_up:
6.1-m5n.metal 82.10% <96.29%> (+0.02%) :arrow_up:
6.1-m6a.metal 81.41% <96.29%> (+0.02%) :arrow_up:
6.1-m6g.metal 79.39% <96.29%> (+0.01%) :arrow_up:
6.1-m6i.metal 82.09% <96.29%> (+0.02%) :arrow_up:
6.1-m7g.metal 79.40% <96.29%> (+0.02%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 09 '24 09:02 codecov[bot]

Hello @cm-iwata ,

Thank you for your contribution! Could you please fix clippy test errors? https://buildkite.com/firecracker/firecracker-pr/builds/8932#018d8d15-d42f-4863-969e-4df250266426/55-402

zulinx86 avatar Feb 09 '24 09:02 zulinx86

@zulinx86 Sorry. fixed it.

cm-iwata avatar Feb 09 '24 10:02 cm-iwata

@cm-iwata

Could you please do the following things?:

  • Add a commit message explaining the context like why this commit is needed ( https://buildkite.com/firecracker/firecracker-pr/builds/8976#018d9cd9-1a22-4508-b4bb-f1da0279335a/55-182 )
  • Fix all the unit test failures ( https://buildkite.com/firecracker/firecracker-pr/builds/8976#018d9cd9-1a2e-407f-b159-6eca360663da )
  • Post the result here when running it in parallel as instructed in #4412 since we cannot still run all the rust unit tests in parallel at the moment.

Thanks!

zulinx86 avatar Feb 12 '24 13:02 zulinx86

It's a matter of taste, but it might be better to use "test(jailer)" rather than "fix(jailer)". "fix" may look like fixing a bug of jailer. I believe it's like improvement of jailer unit tests.

zulinx86 avatar Feb 12 '24 17:02 zulinx86

@zulinx86 Thank you for the review!

Add a commit message explaining the context like why this commit is needed

fixed.

Fix all the unit test failures

I believe I've probably fixed the relevant parts. Since the tests were passing in my dev environment before the fix, I haven't been able to confirm correctly that the fix made the tests pass. Also, since some unit tests are failing in my dev environment, I haven't been able to confirm that all unit tests are passing. Could you please run CI?

※I use GCEfor dev environment.

Post the result here when running it in parallel as instructed in Fix running cargo test in parallel for vmm #4412 since we cannot still run all the rust unit tests in parallel at the moment.

here is result.

root@dbaa0c959c45:/firecracker# cargo test --package jailer
warning: patch for `kvm-bindings` uses the features mechanism. default-features and features will not take effect because the patch dependency does not support this mechanism
    Finished test [unoptimized + debuginfo] target(s) in 0.10s
     Running unittests src/main.rs (build/cargo_target/debug/deps/jailer-2ccd13efbd56ca77)

running 32 tests
test cgroup::tests::test_cgroup_builder_no_mounts ... ok
test cgroup::tests::test_get_controller ... ok
test env::tests::test_dup2 ... ok
test env::tests::test_join_netns ... ok
test env::tests::test_parse_resource_limits ... ok
test cgroup::tests::test_inherit_from_parent ... ok
test env::tests::test_copy_exec_to_chroot ... ok
test resource_limits::tests::test_default_resource_limits ... ok
test env::tests::test_validate_exec_file ... ok
test cgroup::tests::test_cgroup_builder_v2 ... ok
test resource_limits::tests::test_display_resource ... ok
test cgroup::tests::test_cgroup_builder_v1_with_v2_mounts ... ok
test resource_limits::tests::test_from_resource ... ok
test resource_limits::tests::test_install ... ok
test resource_limits::tests::test_set_resource_limits ... ok
test tests::test_clean_env_vars ... ok
test env::tests::test_save_exec_file_pid ... ok
test cgroup::tests::test_cgroup_builder_v1 ... ok
test cgroup::tests::test_cgroup_builder_v2_with_v1_mounts ... ok
test tests::test_to_cstring ... ok
test env::tests::test_copy_cache_info ... ok
test env::tests::test_cgroups_parsing ... ok
test env::tests::test_setup_jailed_folder ... ok
test env::tests::test_userfaultfd_dev ... ok
test env::tests::test_mknod_and_own_dev ... ok
test cgroup::tests::test_cgroup_build ... ok
test cgroup::tests::test_cgroup_build_invalid ... ok
test cgroup::tests::test_cgroup_v2_write_value ... ok
test env::tests::test_new_env ... ok
test tests::test_fds_close_range ... ok
test tests::test_sanitize_process ... ok
test tests::test_fds_proc ... ok

test result: ok. 32 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.12s

cm-iwata avatar Feb 13 '24 02:02 cm-iwata

@cm-iwata Could you please squash 3134792ac78cd256b9b3b0eeebc4dfb6b44d5f87 and 33ced7453eb12edc785cb4c3a9e5d58a0cf1ed04 into one commit? Two identical commits are shown in https://github.com/firecracker-microvm/firecracker/pull/4438/commits.

zulinx86 avatar Mar 13 '24 13:03 zulinx86

Could you please squash 3134792 and 33ced74 into one commit?

Sorry. I made a mistake while rebasing. I squashed tow commits.

cm-iwata avatar Mar 14 '24 04:03 cm-iwata

I thought the code was functioning correctly after the fix, but I've noticed occasional test failures. Could you please advise on the following issues?

1.Too many open files

Sometimes tests fail as follows.

---- tests::test_fds_proc stdout ----
thread 'tests::test_fds_proc' panicked at src/jailer/src/main.rs:409:33:
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }

---- tests::test_fds_close_range stdout ----
thread 'tests::test_fds_close_range' panicked at src/jailer/src/main.rs:409:33:
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- tests::test_sanitize_process stdout ----
thread 'tests::test_sanitize_process' panicked at src/jailer/src/main.rs:409:33:
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }

The code causing this issue can be found here.

https://github.com/firecracker-microvm/firecracker/blob/b2eee1924637d659f551bb397cc4f427598d71b4/src/jailer/src/main.rs#L405-L408

When running the tests, 100 files are create per thread, it causing the file descriptor limit to be exceeded when running tests in multithread. Changing the loop condition n from 100 to 10 resolved the issue. Is there any specific reason for using the number 100? If not, would it be acceptable to change n to 10?

2. Bad file descriptor

Even after fixing 1, tests may still fail.

---- env::tests::test_new_env stdout ----
thread 'env::tests::test_new_env' panicked at src/jailer/src/env.rs:911:14:
This another new environment should be created successfully.: ReadLine("/tmp/firecracker/test/MCzQ/jailer/proc/mounts", Os { code: 9, kind: Uncategorized, message: "Bad file descriptor" })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I believe the cause is the following code.

https://github.com/firecracker-microvm/firecracker/blob/b2eee1924637d659f551bb397cc4f427598d71b4/src/jailer/src/main.rs#L442-L445

The error should occur in the following sequence:

  • thread1 opens fd4.
  • thread2 opens fd5.
  • thread1 closes fd4 and fd5.
  • thread 2 access to fd5, then Bad file descriptor error will occur

Commenting out the above test cases allows the tests to run successfully. Do you have any suggestions for how to fix the code? I couldn't think of any good ideas, so I would appreciate any advice.

cm-iwata avatar Mar 14 '24 07:03 cm-iwata

Hi @cm-iwata ,

I really apologize for my very late response :( I was overwhelmed with other prioritized tasks...

I'll definitely come back to this PR soon (at least during this week). I appreciate your understanding and patience.

Please let me know if you came up with something for the bad file descriptor issue.

Thanks,

zulinx86 avatar Aug 12 '24 09:08 zulinx86

Hi @cm-iwata,

The change itself looks good to me. Could you squash all the commits into one, because the later 4 commits are refactoring for the first one?

Regarding the "Too many open files" issue, I found a pretty weird behavior of Rust test. I inserted a code running ulimit -n at the beginning of run_close_fds_test() like this.

        let output = std::process::Command::new("sh")
            .arg("-c")
            .arg(format!("ulimit -n"))
            .output()
            .expect("Failed to run shell command");
        let stdout = String::from_utf8_lossy(&output.stdout);
        println!("ulimit -n: {stdout}");

When it succeeds, the value is 2047. On the other hand, when failing, it is 100.

# cargo test -p jailer -- --nocapture
...
ulimit -n: 2047

ulimit -n: 2047
...
test result: ok. 30 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.27s
# cargo test -p jailer -- --nocapture
...
ulimit -n: 100

ulimit -n: 100
...
thread 'thread 'tests::test_sanitize_processtests::test_fds_close_range' panicked at ' panicked at src/jailer/src/main.rssrc/jailer/src/main.rs::418418::3333:
:
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test tests::test_sanitize_process ... FAILED
test tests::test_fds_close_range ... FAILED
...

I'm investigating the root cause of it...

Regarding the second issue (Bad file descriptor), I'm not able to reproduce the issue on my side. Could you provide me reproduction steps?

Thanks,

zulinx86 avatar Aug 14 '24 17:08 zulinx86

Now I found the root cause of different open file limit numbers between runs. Jailer has another test (test_install()) that imposes a new limit (100). If the test is concurrently running with them, those tests will fail because they open 100 files (+ 3 (stdin / stdout / stderr)). We can increase the limit of test_install() to 1000 since its default value is 2048 - 1.

https://github.com/firecracker-microvm/firecracker/blob/a364da806f8093e8d8ab1a8287be4a0efd4e4658/src/jailer/src/resource_limits.rs#L196

https://github.com/firecracker-microvm/firecracker/blob/a364da806f8093e8d8ab1a8287be4a0efd4e4658/src/jailer/src/resource_limits.rs#L12

zulinx86 avatar Aug 15 '24 09:08 zulinx86