raiden
raiden copied to clipboard
Fix flakiness in the Raiden CI
Get builds more reliable
Almost every build fails at the first attempt. There is always a flaky integration test or process of some sort. We need to get this more stable.
Sit down and see how we can tackle this problem.
Goal
We need to get better in finding out if a test is (still) flaky and we need to find a process for that. We should not just close issues of flaky tests because we THINK it is not flaky anymore.
Todos:
- [ ] Find a good way to get a flakiness report. Get a flakiness report in CI? Or analytics? Ask in their support board
- [ ] Create issues for the flaky tests or test setups
- [ ] Create health check for blocks are being generated (latest block numbers are advancing) for Geth and Parity in our test setup and in the private chain or find another way for unreliable geth and parity
Timeboxed for 3 days
@ulope can you add any findings from when you looked at the issue.
I've looked at the 15 latest failures due to flakyness and found the following errors:
- 4x TimeoutExpired
- 10x Setup and Call timeout >540.0s
- 1x matrix_client.errors.MatrixRequestError: 403: {"errcode":"M_FORBIDDEN","error":"No create event in auth events"}
14 of the 15 errors are timeouts.
Here's a Makefile that helps with getting logs for failed tests in bulk. It's still quite rough (needs multiple make runs to actually finish all steps), but might be helpful in future investigations. It clearly was helpful for this issue.
# Configuration
CIRCLE_TOKEN = your_token_here
# number of builds/job (not workflows), will be rounded to the next 100
LIMIT = 1000
# Constants
BUILDS_URL = https://circleci.com/api/v1.1/project/github/raiden-network/raiden?limit=100&shallow=true&filter=failed&circle-token=$(CIRCLE_TOKEN)
all: download_artifacts untar_all
download_artifacts: json/builds.json $(shell jq '.[] | select(.workflows.job_name | endswith("integration")) | .build_num' -r json/builds.json | sed 's|.*|json/artifacts_&.done|')
json/artifacts_%.json:
curl -s "https://circleci.com/api/v1.1/project/github/raiden-network/raiden/$*/artifacts?circle-token=$(CIRCLE_TOKEN)" > json/artifacts_$*.json
json/artifacts_%.done: json/artifacts_%.json
jq -r '.[].url' $< | wget --no-verbose --no-clobber --force-directories --directory-prefix=artifacts -i -
touch $@
json/builds.json: SHELL:=/bin/bash
json/builds.json:
mkdir -p json untarred
echo > $@
@echo Get list of recent builds
for offset in {0..$(LIMIT)..100}; do \
echo curl -s "$(BUILDS_URL)&offset=$$offset" ;\
curl -s "$(BUILDS_URL)&offset=$$offset" >> $@ && echo >> $@ ;\
done
remove_build_list:
rm -f json/builds.json
refresh_build_list: remove_build_list json/builds.json
untar_all: $(shell find . -name "*.tar.gz" | sed 's/$$/.done/')
%.tar.gz.done: %.tar.gz
DIR=untarred/`dirname $@| cut -d/ -f 2-` && mkdir -p $$DIR && tar xf $< -C $$DIR
touch $@
.PHONY: all download_artifacts untar_all remove_build_list refresh_build_list
.SECONDARY:
I'll monitor failures on the develop branch. If anyone sees flaky tests in other branches, let me know. Until proven otherwise, I'll assume the changes in https://github.com/raiden-network/raiden/pull/6313 and https://github.com/raiden-network/raiden/pull/6312 are sufficient to remove most of our flakiness.
parity is still failing:
https://app.circleci.com/pipelines/github/raiden-network/raiden/9352/workflows/c8a45717-3b50-47fd-b0ae-cab555388a9b/jobs/130304/steps