raiden icon indicating copy to clipboard operation
raiden copied to clipboard

Fix flakiness in the Raiden CI

Open Dominik1999 opened this issue 4 years ago • 5 comments

Get builds more reliable

Almost every build fails at the first attempt. There is always a flaky integration test or process of some sort. We need to get this more stable.

Sit down and see how we can tackle this problem.

Goal

We need to get better in finding out if a test is (still) flaky and we need to find a process for that. We should not just close issues of flaky tests because we THINK it is not flaky anymore.

Todos:

  • [ ] Find a good way to get a flakiness report. Get a flakiness report in CI? Or analytics? Ask in their support board
  • [ ] Create issues for the flaky tests or test setups
  • [ ] Create health check for blocks are being generated (latest block numbers are advancing) for Geth and Parity in our test setup and in the private chain or find another way for unreliable geth and parity

Timeboxed for 3 days

Dominik1999 avatar Apr 24 '20 08:04 Dominik1999

@ulope can you add any findings from when you looked at the issue.

GataKamsky avatar Jun 11 '20 15:06 GataKamsky

I've looked at the 15 latest failures due to flakyness and found the following errors:

  • 4x TimeoutExpired
  • 10x Setup and Call timeout >540.0s
  • 1x matrix_client.errors.MatrixRequestError: 403: {"errcode":"M_FORBIDDEN","error":"No create event in auth events"}

14 of the 15 errors are timeouts.

karlb avatar Jun 15 '20 15:06 karlb

Here's a Makefile that helps with getting logs for failed tests in bulk. It's still quite rough (needs multiple make runs to actually finish all steps), but might be helpful in future investigations. It clearly was helpful for this issue.

# Configuration
CIRCLE_TOKEN = your_token_here
# number of builds/job (not workflows), will be rounded to the next 100
LIMIT = 1000

# Constants
BUILDS_URL = https://circleci.com/api/v1.1/project/github/raiden-network/raiden?limit=100&shallow=true&filter=failed&circle-token=$(CIRCLE_TOKEN)

all: download_artifacts untar_all

download_artifacts: json/builds.json $(shell jq '.[] | select(.workflows.job_name | endswith("integration")) | .build_num' -r json/builds.json | sed 's|.*|json/artifacts_&.done|')

json/artifacts_%.json:
	curl -s "https://circleci.com/api/v1.1/project/github/raiden-network/raiden/$*/artifacts?circle-token=$(CIRCLE_TOKEN)" > json/artifacts_$*.json

json/artifacts_%.done: json/artifacts_%.json
	jq -r '.[].url' $< | wget --no-verbose --no-clobber --force-directories --directory-prefix=artifacts -i -
	touch $@

json/builds.json: SHELL:=/bin/bash
json/builds.json:
	mkdir -p json untarred
	echo > $@
	@echo Get list of recent builds
	for offset in {0..$(LIMIT)..100}; do \
		echo curl -s "$(BUILDS_URL)&offset=$$offset" ;\
		curl -s "$(BUILDS_URL)&offset=$$offset" >> $@ && echo >> $@ ;\
	done

remove_build_list:
	rm -f json/builds.json

refresh_build_list: remove_build_list json/builds.json

untar_all: $(shell find . -name "*.tar.gz" | sed 's/$$/.done/')

%.tar.gz.done: %.tar.gz
	DIR=untarred/`dirname $@| cut -d/ -f 2-` && mkdir -p $$DIR && tar xf $< -C $$DIR
	touch $@

.PHONY: all download_artifacts untar_all remove_build_list refresh_build_list
.SECONDARY:

karlb avatar Jun 17 '20 09:06 karlb

I'll monitor failures on the develop branch. If anyone sees flaky tests in other branches, let me know. Until proven otherwise, I'll assume the changes in https://github.com/raiden-network/raiden/pull/6313 and https://github.com/raiden-network/raiden/pull/6312 are sufficient to remove most of our flakiness.

karlb avatar Jun 18 '20 06:06 karlb

parity is still failing:

https://app.circleci.com/pipelines/github/raiden-network/raiden/9352/workflows/c8a45717-3b50-47fd-b0ae-cab555388a9b/jobs/130304/steps

hackaugusto avatar Jun 29 '20 11:06 hackaugusto