atom icon indicating copy to clipboard operation
atom copied to clipboard

Create a CI/CD workflow for ATOM using GitHub Actions

Open Kazadhum opened this issue 1 year ago • 30 comments

I've been working on getting ATOM to be integrated in a CI/CD pipeline, in which simulations are run on the Cloud, using AWS as a Cloud service.

To this end, I am using Rigel as a tool for the "preparation" of ATOM for Cloud simulations. Specifically, Rigel creates two Docker containers:

  • a simulation application, that (for now) simply launches a gazebo environment with a simple robot (two RGB cameras in a tripod);
  • a robot application that calibrates a dataset (for now).

Note that the names simulation and robot are only those because that's the nomenclature that AWS uses. Generally, the simulation app includes the things the we don't want to test, while the robot app contains everything we wish to test - in this case, the calibration process.

Rigel is a plugin-based tool, and so far I'm making use of 4 of them, the first three of which are:

  • dockerfile automatically creates a Dockerfile for the creation of a Docker image, in this case, contains ATOM and the robot we wish to calibrate;
  • build actually builds said image;
  • testwhich tests the applications locally.

The fourth plugin is one I am currently developing, that does an introspection of a process (in this case, the calibration process) by reading a results .csv file. So the way this fits into my current pipeline is the following: my robot application not only runs the calibration, but also runs the calibration evaluation, outputting said .csv file (this is only a feature in @JorgeFernandes-Git's ATOM branch, if I'm not mistaken). The test plugin, after running this evaluation, extracts the produced .csv file from the container, which is used by my file_introspection plugin.

You can see this working in this video:

local_testing_calibration_evaluation.webm

The next step in this project is to get these local tests and the introspection to be run on a CI/CD pipeline, using GitHub Actions. I am going to be working on my ATOM fork, found here.

Kazadhum avatar Mar 24 '23 21:03 Kazadhum

Your work looks awesome @Kazadhum. Congratulations

JorgeFernandes-Git avatar Mar 25 '23 00:03 JorgeFernandes-Git

Great work @Kazadhum . Keep it up.

miguelriemoliveira avatar Mar 25 '23 08:03 miguelriemoliveira

I'm running into an issue right at the beginning of trying to implement this. I've tried figuring this out but it's proving to be more complicated than I thought to find answers to this. So running a very simple workflow, I'm getting an error in the install_target_depencies stage of the industrial_ci job. @rarrais, @MisterOwlPT, do you have any insights?

  $ ( source /opt/ros/noetic/setup.bash && rosdep install -q --from-paths /root/target_ws/src --ignore-src -y | grep -E '(executing command)|(Setting up)' ; )
  ERROR: the following packages/stacks could not have their rosdep keys resolved
  to system dependencies:
  atom_calibration: Cannot locate rosdep definition for [python3-graphviz-pip]
  '( source /opt/ros/noetic/setup.bash && rosdep install -q --from-paths /root/target_ws/src --ignore-src -y | grep -E '(executing command)|(Setting up)' ; )' returned with 1
'install_target_dependencies' returned with code '1' after 0 min 1 sec

Kazadhum avatar Mar 28 '23 17:03 Kazadhum

It seems that the issue is related to dependencies -- maybe the package atom_calibration is not properly declared as a dependency? Can you please paste here the link to the log so that we have a look at what stage of the process this happens?

Let me tag here also @sergiodmlteixeira so that he has a look as well.

rarrais avatar Mar 29 '23 13:03 rarrais

Thank you, @rarrais, here's the log. I did try changing that dependency in the atom_calibration package to python3-graphviz (from python3-graphviz-pip) which did work, but I wanted to see if I could do it without changing the ATOM dependencies.

For the record, here's the log from when I changed the dependency in ´atom_calibration`.

In the meanwhile, I've been working on getting Rigel to run without using industrial_ci to build ATOM first (to save Actions usage time).

Kazadhum avatar Mar 29 '23 15:03 Kazadhum

Ok, so I've had a bit of trouble running Rigel in Github Actions. First, I had issues with the poetry environment, because I couldn't use the poetry shell command to "get into" the virtual environment. I fixed this by simply activating it using:

. /home/runner/.cache/pypoetry/virtualenvs/rigel-u4E7_ENg-py3.10/bin/activate

The problem I'm running into currently is the following: I can enter the virtual env and the rigel run command is recognized, but I'm getting 'unexpected extra arguments' for the job dockerfile part. Here's the log. The same thing happens when running rigel run sequence deploy.

Do you have any ideas, @rarrais, @MisterOwlPt, @sergiodmlteixeira? Thanks in advance

Kazadhum avatar Mar 30 '23 11:03 Kazadhum

Hi @rarrais, @MisterOwlPT and @miguelriemoliveira! I'm tagging you to let you know about my progress!

I've managed to circumvent this issue by simply installing both Rigel's develop branch and my File Introspection Plugin via pip in Github Actions, and I've managed to run rigel run sequence deploy (i.e. the Dockerfile generation and Docker image building) without issues. Here's the corresponding log and workflow file.

Now, I'm going to try running the calibration, evaluation and introspection as well, so I can then work on getting several calibrations and introspections running in parallel.

Kazadhum avatar Apr 03 '23 22:04 Kazadhum

I've run into an issue again. When running the testing and introspection plugins in Github Actions, the calibration evaluation results file rgb_to_rgb_results.csv is not found inside the container. Because of this, it isn't saved as an artifact and the introspection does not run. Here's the log. This did not happen locally. I suspect this might be a problem with environment variables, but I'm not too sure about that.

Kazadhum avatar Apr 04 '23 18:04 Kazadhum

Hi @Kazadhum , looking at the logs and at the workflow file, it seems to me that the issue might be the plugin not being capable of accessing the file. Looking at the logs it seems that the file is not in this directory: /home/runner/.rigel/archives/test/latest/calibration_evaluation/rgb_to_rgb_results.csv.

Do we have a way to confirm where the calibration procedure is saving the file?

rarrais avatar Apr 05 '23 09:04 rarrais

Hi @rarrais! By looking at line 95 of that same log, it looks like the file is not in that directory because it doesn't exist in the container at all. I'll run some debugging commits to see if I can ascertain the reason.

Kazadhum avatar Apr 05 '23 11:04 Kazadhum

Hi @rarrais! By looking at line 95 of that same log, it looks like the file is not in that directory because it doesn't exist in the container at all. I'll run some debugging commits to see if I can ascertain the reason.

Looks like this is correct, and I've confirmed this by checking the directory structure, like this.

So it seems that the issue occurs inside the container generated by Rigel. Rigel cannot find the file inside the container to output as an artifact. The weird thing is that this does not happen when running this locally. So I wanted to ask, @rarrais @MisterOwlPT @sergiodmlteixeira, because I can't find a clear answer online or in the Actions documentation, what exactly is running my workflow? Is it a container, a VM?

Thanks in advance :)

Kazadhum avatar Apr 05 '23 15:04 Kazadhum

Hello @rarrais! Since we spoke yesterday, I've been working locally with act and using this I can attach a terminal to the rigel container inside the github actions container! And I've found that the issue is that the calibration_evaluation container runs into the following error:


ImportError: this platform is not supported: ('failed to acquire X connection: Can\'t connect to display ":0": b\'No protocol specified\\n\'', DisplayConnectionError(':0', b'No protocol specified\n'))

Try one of the following resolutions:

 * Please make sure that you have an X server running, and that the DISPLAY environment variable is set correctly
INFO - 2023-04-13 14:35:42,886 - core - signal_shutdown [atexit]

I'm trying to open the xhost server as part of the workflow to see if that fixes it.

Kazadhum avatar Apr 13 '23 14:04 Kazadhum

Hi @Kazadhum , good news on the progress. Do you know why/if calibration_evaluation actually needs an X server to run? As it is part of ATOM, maybe @miguelriemoliveira can help.

rarrais avatar Apr 13 '23 14:04 rarrais

Hi @Kazadhum , good news on the progress. Do you know why/if calibration_evaluation actually needs an X server to run? As it is part of ATOM, maybe @miguelriemoliveira can help.

@rarrais It might have been a mistake on my end. It probably doesn't need it and I just put it in the Rigelfile in case it was needed. I'll run some local tests to check.

Kazadhum avatar Apr 13 '23 15:04 Kazadhum

Hey @rarrais and @miguelriemoliveira. It seems I was wrong and it is in fact ATOM that needs the X server to run, so it wasn't my mistake.

Running the calibration locally without mounting the X11 volume and declaring the DISPLAY env variable results in the same error. So it seems the question now is: how can I enable the X server inside the Github Actions container? Because just running xhost + doesn't seem to work. I tried this using act and then I ran it on Github to show you the log.

Kazadhum avatar Apr 13 '23 15:04 Kazadhum

Hi @Kazadhum ,

The calibration evaluation will run some imshows from opencv, thus it needs an X server.

I think there is a mode however where no windows ares launched.

How are you launching the script?

miguelriemoliveira avatar Apr 14 '23 09:04 miguelriemoliveira

Hi @miguelriemoliveira! I'm using:

rosrun atom_evaluation rgb_to_rgb_evaluation -train_json $ATOM_DATASETS/t2rgb/atom_calibration.json -test_json $ATOM_DATASETS/t2rgb/dataset.json --sensor_source right_camera --sensor_target left_camera --show_images False -sfr $HOME/

Note that I have show_images set to false, but the same error happens. Thank you! :)

Kazadhum avatar Apr 15 '23 14:04 Kazadhum

Hi @Kazadhum ,

I think the show_images flag is an action_true flag, meaning you add it to set true, and do not enter it to have it false.

Can you post the output of

rosrun atom_evaluation rgb_to_rgb_evaluation -h

miguelriemoliveira avatar Apr 15 '23 18:04 miguelriemoliveira

Hi @Kazadhum ,

I think the show_images flag is an action_true flag, meaning you add it to set true, and do not enter it to have it false.

Can you post the output of

rosrun atom_evaluation rgb_to_rgb_evaluation -h

Hello @miguelriemoliveira! Here's the output you asked for:

usage: rgb_to_rgb_evaluation [-h] -train_json TRAIN_JSON_FILE -test_json TEST_JSON_FILE -ss
                             SENSOR_SOURCE -st SENSOR_TARGET [-si] [-sfr SAVE_FILE_RESULTS]

optional arguments:
  -h, --help            show this help message and exit
  -train_json TRAIN_JSON_FILE, --train_json_file TRAIN_JSON_FILE
                        Json file containing train input dataset.
  -test_json TEST_JSON_FILE, --test_json_file TEST_JSON_FILE
                        Json file containing test input dataset.
  -ss SENSOR_SOURCE, --sensor_source SENSOR_SOURCE
                        Source transformation sensor.
  -st SENSOR_TARGET, --sensor_target SENSOR_TARGET
                        Target transformation sensor.
  -si, --show_images    If true the script shows images.
  -sfr SAVE_FILE_RESULTS, --save_file_results SAVE_FILE_RESULTS
                        Output folder to where the results will be stored.

Kazadhum avatar Apr 15 '23 18:04 Kazadhum

Thanks,

so you see the [-si] ? iIt does not have value in capitals after it.

That means if you want images you use

rosrun atom_evaluation rgb_to_rgb_evaluation  ... --show_images 

if you you do not you run

rosrun atom_evaluation rgb_to_rgb_evaluation  ... 

miguelriemoliveira avatar Apr 15 '23 18:04 miguelriemoliveira

Thanks,

so you see the [-si] ? iIt does not have value in capitals after it.

That means if you want images you use

rosrun atom_evaluation rgb_to_rgb_evaluation  ... --show_images 

if you you do not you run

rosrun atom_evaluation rgb_to_rgb_evaluation  ... 

Hi @miguelriemoliveira! Thank you the reply. The weird thing is, even without the [-si] oprion, it still returns the same error message and I don't really know why. @MisterOwlPT, have you had any similar experiences with Github Actions?

EDIT: I've found this Stack Overflow entry: https://stackoverflow.com/questions/63125480/running-a-gui-application-on-a-ci-service-without-x11. I'm trying to see if I can use this action (https://github.com/coactions/setup-xvfb) to get through this part.

Kazadhum avatar Apr 17 '23 16:04 Kazadhum

Let me do some experiments and get back to you ...

miguelriemoliveira avatar Apr 17 '23 21:04 miguelriemoliveira

Hi @Kazadhum ,

I looked into the script and without the -si flag it should not require an xserver running. Perhaps it is a problem with the docker/riegel stuff...

miguelriemoliveira avatar Apr 17 '23 21:04 miguelriemoliveira

Hello! Thank you for running those tests, it most likely is on either rigel or docker's side, as you have said. I've since found someone who faced a similar problem with Gitlab CI and solved it using Xvfb (https://forum.gitlab.com/t/run-things-that-need-a-glx-x-server-on-gitlab-ci/47440) and so I'll try to reproduce their solution.

Although, I wonder @rarrais and @MisterOwlPT, when working with AWS and running the simulations online, this wouldn't be a problem, right?

Kazadhum avatar Apr 18 '23 13:04 Kazadhum

Hello @Kazadhum,

I've been doing some tests and I think I found a solution to your problem. I took your CI/CD workflow and executed every step manually inside an empty AWS EC2 instance. I cloned your fork of Atom and installed all dependencies as per the main.yml file inside the .github/workflows folder (i.e., rigel, your plugin, system dependencies, ...). I copied the file rigelfiles/Rigelfile_1 to the root of the repository and renamed it to Rigelfile.

Executing the command rigel run sequence test and then docker logs -f calibration_evaluation I was able to replicate the error.

Solution:

  • I locally altered your image (dvieira2001/atom:latest) and installed xvfb-run (apt install xvfb) inside it. It allows you to run graphical applications without a "real" display (it creates one in memory);
  • I altered the Docker execution command in the Rigelfile to ["/bin/bash", "-c", "xvfb-run rosrun atom_calibration calibrate ... && xvfb-run rosrun atom_evaluation rgb_to_rgb_evaluation ..."] . Note that xvfb-run was added before every sub command. This ensures everything can communicate with the virtual X server;
  • I used X server :99 (export DISPLAY=:99). This is the default server number used;

This way everything worked out perfectly.

I saw that you tried to use xvfb-run without success before. Can you try one more time with these steps? Maybe you missed something. Don't forget to update the image first! Consider adding the dependency on the Rigelfile and deploying it to the registry.

NOTE: since everything "graphical" is handled inside the container I found it irrelevant to map /tmp/.X11-unix -> /tmp/.X11-unix;

Let me know if this message was useful and if your problem was solved 😃

PS: I found a typo in the field command of the simulation_and_robot component in the Compose plugin. Looking at the Dockerfile I don't think it is that important but still... you are using bin/bash instead of /bin/bash. This is causing the container to fail.

MisterOwlPT avatar Apr 26 '23 14:04 MisterOwlPT

Hi @MisterOwlPT! Thank you so much for the testing you've done! I've been running some tests with the corrections you've made and, even though it's still not working properly, it's for another reason!

So, before, nothing ran in the calibration_evaluation. Now, however, the calibration procedure is, indeed, run. The evaluation process, however, is not. I suspect if might be because of the command syntax, so I'll try some alternatives. But it seems the problem with the X server is indeed solved! I had tried to use xvfb before but I hadn't installed in inside the container, so I assume that was the core issue.

Kazadhum avatar Apr 27 '23 00:04 Kazadhum

Hello @miguelriemoliveira, @rarrais and @MisterOwlPT! I can confirm it works! here the log of the successful CI workflow run.

The problem:

Turns out that, for some reason, it wasn't running the last command. So, when I had: command: ["/bin/bash", "-c", "xvfb-run rosrun atom_calibration calibrate -json $ATOM_DATASETS/t2rgb/dataset.json -v && xvfb-run rosrun atom_evaluation rgb_to_rgb_evaluation -train_json $ATOM_DATASETS/t2rgb/atom_calibration.json -test_json $ATOM_DATASETS/t2rgb/dataset.json --sensor_source right_camera --sensor_target left_camera -sfr $HOME/"]

in the Rigelfile, only xvfb-run rosrun atom_calibration calibrate -json $ATOM_DATASETS/t2rgb/dataset.json was being run.

The solution:

I replaced the command above with the following:

command: ["/bin/bash", "-c", "xvfb-run rosrun atom_calibration calibrate -json $ATOM_DATASETS/t2rgb/dataset.json -v && xvfb-run rosrun atom_evaluation rgb_to_rgb_evaluation -train_json $ATOM_DATASETS/t2rgb/atom_calibration.json -test_json $ATOM_DATASETS/t2rgb/dataset.json --sensor_source right_camera --sensor_target left_camera -sfr $HOME/ && cd $HOME/ && ls"]

This way, only the ls command is not run. Even though I have no clue as to why it works like this, it apparently does :smile:

Kazadhum avatar Apr 27 '23 15:04 Kazadhum

Hi @miguelriemoliveira and @rarrais!

Happy to report that code coverage using Codacy is set up! I wrote a couple of unit tests for the naming.py module in the atom_core package. Then, I used the coverage python package to produce a coverage report.

By doing this inside of a CI/CD workflow and using the codacy-coverage-reporter action in GitHub (a secret Codacy API token is needed for this), the code coverage report is shown in Codacy, updated every time a push event occurs.

Here's the workflow file I used and the log for the respective Actions job.

As you can see, Codacy has identified a number of issues in it's static code analysis, that go against standard coding practises: image

An example of the type of issues it encountered: image

When it comes to code testing coverage, only 2 files are counted as being covered: the naming.py module and the test_naming.py file, containing the unit tests.

image

It doesn't really make much sense to write a lot more tests for ATOM, but this is a positive contribution to the CI/CD pipeline to be used for other ROS applications, especially if these are developed with testing in mind (perhaps using Test Driven Development methods).

Kazadhum avatar May 05 '23 00:05 Kazadhum

Hi @Kazadhum ,

Nice progress, and good results for the thesis. I would say that testing is sufficient, however, I'm concerned about the image you posted - surely not 100% of the ATOM source code is covered by the testing you coded. Could you please try to update such an indication to a realistic percentage? It might be necessary for you to adjust the files that are taken into account to compute this coverage percentage, so as to include them in the analysis even if they do not contain unit testing.

Another suggestion that you might want to come back to in the future, if time allows, is to go over the style/security indications provided by Codacy and actually suggest changes on a pull request that would fix those existing issues. I believe the comparison between before and after intervention would be a good outcome/result of your work.

rarrais avatar May 05 '23 10:05 rarrais

Hey @rarrais!

These results will complement my thesis nicely, I agree.

About the code coverage, I've tried accounting for all Python scripts using Coverage.py but I have been unsuccessful. Weirdly, the report produced seems to only account for the lines of code in scripts already contemplated by tests (and the test scripts themselves).

I also think that the comparison between the "before and after" intervention to fix these issues seems like a good addition, and I'll definitely come back to it soon!

Kazadhum avatar May 05 '23 17:05 Kazadhum