apisix-ingress-controller icon indicating copy to clipboard operation
apisix-ingress-controller copied to clipboard

request help: How to debug e2e-test fail case efficiently

Open stillfox-lee opened this issue 3 years ago • 8 comments

Issue description

Hi, guys. Currently, e2t-test still has a randomly fail problem. I tried to set the breakpoint to debug, but the error is not reproduced. Is there an efficient way to solve this problem? What's best practice?

Environment

  • your apisix-ingress-controller version (output of apisix-ingress-controller version --long):
  • your Kubernetes cluster version (output of kubectl version):
  • if you run apisix-ingress-controller in Bare-metal environment, also show your OS version (uname -a):

stillfox-lee avatar Jul 07 '22 10:07 stillfox-lee

You can add focus to only run a specific test case

tao12345666333 avatar Jul 07 '22 10:07 tao12345666333

https://onsi.github.io/ginkgo/

tao12345666333 avatar Jul 07 '22 10:07 tao12345666333

You can add focus to only run a specific test case

Yes, focus can do that. But ginkgo will teardown after every test case. I wanted to keep env which setup by test case. So that I can debug and find out the reason cause test case to fail.

stillfox-lee avatar Jul 07 '22 12:07 stillfox-lee

Just add a sleep? 😜

tao12345666333 avatar Jul 07 '22 13:07 tao12345666333

Just add a sleep? 😜

抱歉,我的英文水平不能准确地表达我的意思,所以我用中文回复了。 当然,通过sleep是可以保留现场,留足够多的时间来 debug。但是,我觉得这不是一个优雅的方式。主要有以下几个原因:

  1. 我们所讨论的是e2e-test的一些特殊性的随机错误,二次执行的时候它可能就不会发生了,所以,当测试用例执行失败之后再添加sleep,不能有效的定位这个场景下的问题。
  2. 我在想对于随机错误这种问题,我们是否需要一种通用的解决方案?一种可以加入到e2e-test代码里的方案。比如说,在测试用例执行失败的时候,向k8s集群进行采样,获取足够多的信息来辅助我们定位问题。这样就不需要我们测试失败之后,加入sleep再次重跑,然后再看集群里面的实际情况来确定问题。

然后,关于随机错误。我猜想主要有两种原因:一是客观的集群环境问题导致,对于这种情况也许只能再次执行测试。二是我们的测试用例构造不完善,有一些边界情况会导致错误发生。对于第二种情况,我认为还是需要去解决的。 从我个人仅有的 PR 经历来看,CI 中经常会出现这种随机错误。从 contributor 或者是 reviewer来看,这其实是会耗费我们大量的时间成本的。所以我认为我们还是需要花费一些时间来提高测试的稳定性。 我不清楚这个方面业界的最佳实践是什么,如果可以给我一些指导的话,我可以在这方面为项目做一些改善。

stillfox-lee avatar Jul 07 '22 15:07 stillfox-lee

我在想对于随机错误这种问题,我们是否需要一种通用的解决方案?一种可以加入到e2e-test代码里的方案。比如说,在测试用例执行失败的时候,向k8s集群进行采样,获取足够多的信息来辅助我们定位问题。这样就不需要我们测试失败之后,加入sleep再次重跑,然后再看集群里面的实际情况来确定问题。

Currently we have the relevant processing logic. But maybe we can make some optimizations.

For example, we can add an environment variable. If it is run locally by the developer, when a failure occurs, the relevant resources will be kept and not deleted.

https://github.com/apache/apisix-ingress-controller/blob/f0217ae5b022d6086bab2155dd3053567b3fc3aa/test/e2e/scaffold/scaffold.go#L440-L475

然后,关于随机错误。我猜想主要有两种原因:一是客观的集群环境问题导致,对于这种情况也许只能再次执行测试。二是我们的测试用例构造不完善,有一些边界情况会导致错误发生。对于第二种情况,我认为还是需要去解决的。

You are right!

If there is a job failure, we usually create an issue to track until the problem goes away.

tao12345666333 avatar Jul 08 '22 07:07 tao12345666333

For example, we can add an environment variable. If it is run locally by the developer, when a failure occurs, the relevant resources will be kept and not deleted.

That's a good idea. Maybe I can create PR for this?

stillfox-lee avatar Jul 12 '22 13:07 stillfox-lee

Sure! Pleasessss!

tao12345666333 avatar Jul 12 '22 15:07 tao12345666333

This issue has been marked as stale due to 90 days of inactivity. It will be closed in 30 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Oct 11 '22 01:10 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

github-actions[bot] avatar Nov 11 '22 01:11 github-actions[bot]