influxdb icon indicating copy to clipboard operation
influxdb copied to clipboard

E2E integration tests are flaky

Open hiltontj opened this issue 1 year ago • 5 comments

From time to time, some of the integration tests fail for strange reasons. It may be due to how a port is being selected for the running influxdb3 serve binary that is spun up in the test harness.

There is a function used to select a random available port: https://github.com/influxdata/influxdb/blob/7d37bbbce7ecd4f6c95445de2b22ee189dd93b85/influxdb3/tests/server/main.rs#L305-L319

However, since the bind address is dropped before it is passed in to spawn the server (it needs to be, otherwise the server would not be able to bind that address and would fail to start), then there is a chance that another process or integration test could take over that port before the binary is started here: https://github.com/influxdata/influxdb/blob/7d37bbbce7ecd4f6c95445de2b22ee189dd93b85/influxdb3/tests/server/main.rs#L136

Here are some examples of failures that seem rather odd:

  • https://app.circleci.com/pipelines/github/influxdata/influxdb/41549/workflows/0ec53a3d-8dab-44d4-923a-d40d9cbe2dc7/jobs/388404?invite=true#step-103-88325_47
  • https://app.circleci.com/pipelines/github/influxdata/influxdb/41547/workflows/4ee91a57-f417-4ae6-a7fd-74ef3a243311/jobs/388391?invite=true#step-103-88037_16

hiltontj avatar Oct 03 '24 00:10 hiltontj

One option would be to forego running the actual binary by spawning the influxdb3 serve command, and just call the code to run the service directly, as is done in this function: https://github.com/influxdata/influxdb/blob/7d37bbbce7ecd4f6c95445de2b22ee189dd93b85/influxdb3/src/commands/serve.rs#L320

This would require some refactoring to make sure that the test harness is starting things exactly as is done for the actual running binary, but would allow us to pass in a bound TcpListener/SocketAddr directly, and not have the issue described above.

One problem I see with this is that, with the way we generate IDs for, e.g., databases and tables, using static atomics, if we were to have multiple test harnesses running in a single test, then they could be clashing for IDs.

hiltontj avatar Oct 03 '24 12:10 hiltontj

Another option would be to have the binary log the port it is listening on, and scrape if from the STDOUT in the test harness code.

hiltontj avatar Oct 03 '24 12:10 hiltontj

Another option would be to have an option in the influxdb3 serve command to write its port to a file, or notify some other service of the port it is listening on, and then gather that info from the test harness code after it has spawned the command.

hiltontj avatar Oct 03 '24 13:10 hiltontj

I think I like the log option

pauldix avatar Oct 03 '24 13:10 pauldix

There might be one more option to use unix sockets (AF_UNIX), looks like there's a crate that provides the hyper bridge. It'll work for linux/mac and we can fall back to (AF_INET) sockets for windows perhaps? Given it's mainly for e2e tests, we can create the directory and socket file inside it before the test starts (inside tmp).

praveen-influx avatar Oct 08 '24 13:10 praveen-influx