rally Failing test: test_eventdata_daily

The test_eventdata_daily_volume[in-memory-it] test seems to fail somewhat randomly with exit code 16384 (WTF?):

https://github.com/elastic/rally/blob/7dac7cb00c9d9199c198d8bf1be25d268766eaaf/it/distribution_test.py#L134-L138

We should probably look into why.

Feb 03 '22 04:02 b-deam

So, why exit code 16384? It's because the integration tests use os.system() whose return code is opaque. os.waitstatus_to_exitcode can be used to retrieve the actual exit code, which is 64 in that case. This is what Rally returns when the result is ExitStatus.ERROR.

https://github.com/elastic/rally/blob/846b8e404b4b4c324cc95e6621f4992d0023da81/esrally/rally.py#L1071-L1074

So that's not really helpful. The only logs we currently get are:

07:22:50 it/distribution_test.py::test_eventdata_daily_volume[in-memory-it] 
07:22:50     ____        ____
07:22:50    / __ \____ _/ / /_  __
07:22:50   / /_/ / __ `/ / / / / /
07:22:50  / _, _/ /_/ / / / /_/ /
07:22:50 /_/ |_|\__,_/_/_/\__, /
07:22:50                 /____/
07:22:50 
07:22:56 [INFO] Race id is [ca1c4c24-8a14-487a-a6ce-c618aa1704dd]
07:22:56 [ERROR] Cannot race. Worker [0] has exited prematurely.
07:22:56 
07:22:56 Getting further help:
07:22:56 *********************
07:22:56 * Check the log files in /var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/logs for errors.
07:22:56 * Read the documentation at https://esrally.readthedocs.io/en/latest/.
07:22:56 * Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
07:22:56 * Raise an issue at https://github.com/elastic/rally/issues and include the log files in /var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/logs.
07:22:56 
07:22:56 -------------------------------
07:22:56 [INFO] FAILURE (took 6 seconds)
07:22:56 -------------------------------

Also not helpful! We definitely need to show the Rally logs in case of failure. (And note that this isn't specific to test_eventdata_daily_volume, I've seen this with other tests.)

Feb 03 '22 07:02 pquentin

So @DJRickyB taught me how to get the logs from the Jenkins run, and for #1452 https://elasticsearch-ci.elastic.co/job/elastic+rally+pull-request/127/consoleFull failed. The logs says:

2022-03-07 12:09:18,138 -not-actor-/PID:30535 esrally.rally INFO OS [uname_result(system='Linux', node='elasticsearch-ci-immutable-ubuntu-1804-1646653107324785781', release='5.4.0-1065-gcp', version='#69~18.04.1-Ubuntu SMP Sat Feb 12 01:39:35 UTC 2022', machine='x86_64', processor='x86_64')]
2022-03-07 12:09:18,139 -not-actor-/PID:30535 esrally.rally INFO Python [namespace(_multiarch='x86_64-linux-gnu', cache_tag='cpython-38', hexversion=50858736, name='cpython', version=sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0))]
2022-03-07 12:09:18,139 -not-actor-/PID:30535 esrally.rally INFO Rally version [2.4.0.dev0]
2022-03-07 12:09:18,139 -not-actor-/PID:30535 esrally.utils.net INFO Connecting directly to the Internet (no proxy support).
2022-03-07 12:09:18,336 -not-actor-/PID:30535 esrally.rally INFO Detected a working Internet connection.
2022-03-07 12:09:18,336 -not-actor-/PID:30535 esrally.metrics INFO Creating file race store
2022-03-07 12:09:18,337 -not-actor-/PID:30535 esrally.mechanic.mechanic INFO Could not find race [metrics-store] and will thus not persist system metrics.
2022-03-07 12:09:18,337 -not-actor-/PID:30535 esrally.mechanic.launcher INFO Shutting down [1] nodes on this host.
2022-03-07 12:09:18,510 -not-actor-/PID:30535 esrally.mechanic.launcher INFO Done shutting down node [metrics-store] in [0.2] s.
2022-03-07 12:09:18,512 -not-actor-/PID:30535 esrally.mechanic.provisioner INFO Wiping benchmark candidate installation at [/var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/benchmarks/races/7c7f4602-26cb-4fb5-8727-7da59f0fe94b/metrics-store/install/elasticsearch-7.6.0].

Which does not help, everything here seems normal.

Mar 07 '22 12:03 pquentin

@DJRickyB pointed out to me that the timestamp I was looking at was wrong, because I was looking at the summary, so the log above is just the last successful test. Scrolling up, we can get the timestamps for the actual failure, and see the actual failure in the logs:

2022-03-07 11:48:41,882 ActorAddr-(T|:40813)/PID:15936 esrally.mechanic.provisioner INFO Starting Docker container with configuration:
version: '2.2'
services:
  elasticsearch1:
    cap_add:
      - IPC_LOCK
    image: "docker.elastic.co/elasticsearch/elasticsearch:7.6.0"
    labels:
      io.rally.description: "elasticsearch-rally"
    ports:
      - 19200:19200
      - 9300
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - /var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/benchmarks/races/1fcb8d32-dc97-4a8e-9e17-bd78b8da01c1/rally-node-0/data/948c8933-4c6b-4f6a-9fdf-353440f1a668:/usr/share/elasticsearch/data
      - /var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/benchmarks/races/1fcb8d32-dc97-4a8e-9e17-bd78b8da01c1/rally-node-0/logs/server:/var/log/elasticsearch
      - /var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/benchmarks/races/1fcb8d32-dc97-4a8e-9e17-bd78b8da01c1/rally-node-0/heapdump:/usr/share/elasticsearch/heapdump
      - /var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/benchmarks/races/1fcb8d32-dc97-4a8e-9e17-bd78b8da01c1/rally-node-0/install/config/log4j2.properties:/usr/share/elasticsearch/config/log4j2.properties
      - /var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/benchmarks/races/1fcb8d32-dc97-4a8e-9e17-bd78b8da01c1/rally-node-0/install/config/jvm.options:/usr/share/elasticsearch/config/jvm.options
      - /var/lib/jenkins/workspace/elastic+rally+pull-request/.rally/benchmarks/races/1fcb8d32-dc97-4a8e-9e17-bd78b8da01c1/rally-node-0/install/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
    healthcheck:
      test: nc -z 127.0.0.1 19200
      interval: 5s
      timeout: 2s
      retries: 10
    networks:
      - rally-es
networks:
  rally-es:
    driver_opts:
      com.docker.network.bridge.host_binding_ipv4: "127.0.0.1"
2022-03-07 11:49:12,722 ActorAddr-(T|:32781)/PID:15805 esrally.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:34771)] (ignoring).
2022-03-07 11:48:41,882 ActorAddr-(T|:40813)/PID:15936 esrally.mechanic.launcher INFO Starting node [rally-node-0] in Docker.
2022-03-07 11:49:12,672 ActorAddr-(T|:40813)/PID:15936 esrally.utils.process INFO Creating network "install_rally-es" with the default driver
Pulling elasticsearch1 (docker.elastic.co/elasticsearch/elasticsearch:7.6.0)...
Head "https://docker.elastic.co/v2/elasticsearch/elasticsearch/manifests/7.6.0": dial tcp 34.68.230.202:443: i/o timeout

2022-03-07 11:49:12,673 ActorAddr-(T|:40813)/PID:15936 root ERROR Docker daemon startup failed with exit code [1]

So the undocumented DockerProvisioner calls docker-compose, and here the pull part failed. The fix is to run docker-compose pull with a few retries first, then run docker-compose up.

But do we need DockerProvisioner at all?

Mar 07 '22 13:03 pquentin

rally
rally copied to clipboard

Failing test: test_eventdata_daily_volume[in-memory-it]

rally rally copied to clipboard

Failing test: test_eventdata_daily_volume[in-memory-it]

rally
rally copied to clipboard