gvisor icon indicating copy to clipboard operation
gvisor copied to clipboard

Gvisor CI test on ARM64

Open zhlhahaha opened this issue 4 years ago • 5 comments

Hi @avagin @amscanne I saw there were many ARM64 related patches had merged. And now cross-build for multi-arch binaries and images has enabled. I am wander if there is any plan to expand current CI tests for ARM. We are happy to provide help.

Beside this, I am confused about that gvisor use both buildkite and kokoro for gvisor CI test. Will gvisor CI lane keep using both buildkite and Kokoro, or one of them will be discard?

And the following image is the current gvisor test status summary on ARM64 server.

image

Thanks

zhlhahaha avatar Jan 12 '21 06:01 zhlhahaha

cc: @lubinszARM

zhlhahaha avatar Jan 12 '21 06:01 zhlhahaha

Hey @zhlhahaha, yes, we have been working to get the ARM pieces up to par with the x86_64 infrastructure. @avagin has already added a virtualized ARM smoke test (example).

To provide some context on the Kokoro/buildkite question: Kokoro is a Google-internal tool for test orchestration. I've been migrating all the tests to BuildKite in order to improve:

  • Openness of the infrastructure (less internally tooling)
  • Flexibility of the workflows (you can just edit .buildkite/pipeline.yaml)
  • Some scalability for presubmits (we now have full runtime tests for each commit)

This migration is nearly complete, and the Kokoro tests should no longer be running as part of presubmits.

There is one more reason to use BuildKite -- it does enable proper, physical ARM servers! We definitely want to do this and add it to the pipeline. However, it is critical that the pipeline is reliable, which typically means that the server deployments should be automated in some way (e.g. "auto-healing").

If you have a good way to get a good number of ARM instances, then we'd love to add them to pool and add ARM-based tests to the pipeline. (We did already have the ARM instance registered with the pipeline, but are just thinking about how to make this more reliable and scalable.) Let me know your thoughts!

amscanne avatar Jan 12 '21 06:01 amscanne

Hey @zhlhahaha, yes, we have been working to get the ARM pieces up to par with the x86_64 infrastructure. @avagin has already added a virtualized ARM smoke test (example).

To provide some context on the Kokoro/buildkite question: Kokoro is a Google-internal tool for test orchestration. I've been migrating all the tests to BuildKite in order to improve:

  • Openness of the infrastructure (less internally tooling)
  • Flexibility of the workflows (you can just edit .buildkite/pipeline.yaml)
  • Some scalability for presubmits (we now have full runtime tests for each commit)

This migration is nearly complete, and the Kokoro tests should no longer be running as part of presubmits.

There is one more reason to use BuildKite -- it does enable proper, physical ARM servers! We definitely want to do this and add it to the pipeline. However, it is critical that the pipeline is reliable, which typically means that the server deployments should be automated in some way (e.g. "auto-healing").

If you have a good way to get a good number of ARM instances, then we'd love to add them to pool and add ARM-based tests to the pipeline. (We did already have the ARM instance registered with the pipeline, but are just thinking about how to make this more reliable and scalable.) Let me know your thoughts!

Thank for your quick reply. How many ARM instances are needed?

zhlhahaha avatar Jan 12 '21 06:01 zhlhahaha

Since https://github.com/google/gvisor/pull/5275 has merged, most ptrace test failure should be fixed on ARM64

zhlhahaha avatar Jan 20 '21 05:01 zhlhahaha

Hi @amscanne @avagin @lubinszARM Here is some update for CI test on ARM64 I got following two issues in ptrace syscall test on ARM server. One is solved and another is not.

1. ptrace syscall test with error number 524 ptrace syscall test are always get failed in the middle of test and error number is 524 on arm server. I spend some time to debug it. Conclusion first, this issue has resolved by patch https://www.spinics.net/lists/kernel/msg3768903.html

The reason for this issue is that fqdir_work_fn block system_wq (workqueue), this lead to all work in system_wq are blocked. When gvisor test create new runsc instance and install seccomp filter, bpf jit region is get full because bpf_prog_free_deferred, which is also queued in system_wq, is blocked and no memory get release.

2. ptrace syscall test TIMEOUT When I run ptrace syscall test all together on ARM64 server with following command. Some test case get Timeout randomly. bazel test --nocache_test_results --test_tag_filters=runsc_ptrace //test/syscalls/... I compared the log file of running all syscalls together and one specific test case along, and found that the main reason for TIMEOUT is different creation time of test case in each scenario. And time spend on real test is almost same in both scenario. For example, the creation time of each test case in socket_ipv4_udp_unbound_loopback_test is 0.63s vs 3s in together test and solo test.

zhlhahaha avatar Mar 12 '21 08:03 zhlhahaha

Please let me know if there is any update on this issue.

odidev avatar Dec 20 '22 11:12 odidev

We have ARM build, smoke, unit and syscall tests running on CI: https://buildkite.com/gvisor/pipeline/builds/18213#_

zkoopmans avatar Dec 20 '22 17:12 zkoopmans