Current Rakudo (possibly MoarVM as well) development process hinders releasing
Here I will describe a couple of situations which has happened during last months to show a particular flaw of the current development process this issue brings up. I have zero intention to bash our volunteer developers, so take it as an offense towards development process / culture in a whole. I am happy people are willing to spend their time and efforts toward making Raku implementation better, but I am sure it can be done in a safer, less painful and thus more enjoyable process for everyone.
-
Feb 13, this commit to MoarVM bumps version of the dyncall library shipped with MoarVM. The new version contained a critical bug which lead to build failures when linking against musl, glibc alternative notably used by Alpine distro popular for CI containers. It makes into a release and leads to going off to developers, waiting for a patch and re-taking a release, because there were no checks for something we expect to support to detect it.
-
Mar 15, this PR to MoarVM is merged. This JIT contained a critical bug observable on Windows. Later, revision with this bug is brought in Rakudo and master branch CI checks start to fail. They fail until a fix is provided at Apr 26, that makes a 41 days straight the master branch was broken which (along with the dispatch situation which was solved by a revert) blocked the release completely, because the bump itself was not checked for green lights and then failures were ignored until it was very hard to say where the issue can be.
-
Apr 3, this commit introduces a regression in relocatable builds, which goes on unnoticed, released and May 8 a solution was given, resulting in a release re-take because we had no checks for something we expect to support to detect it.
-
Jun 5, this PR resolves a long standing Windows issue. All checks are failed due to, possibly, our setup not supporting changes which involve PRs in MoarVM/nqp/rakudo being needed simultaneously. After merge, Windows builds of master start to fail and again they fail for 14 days until someone provides a two-line patch after some asking for it for a couple of days before release.
-
Jun 6, this commit uses declarations not compatible with older gcc. This commit offends our Circle CI check after a bump since around here and nobody is bothered by the check failing. When I tried to go for it, the website redirects me to sign-in page. Seeing this unusual behavior (as usually you don't need to do anything to view CI public logs, maybe it wants one to sign-in and enable them?), assuming we migrated to Azure and the check will be dropped soon anyway, we go into release. Now, master fails for 15 days straight and counting. A single person's, release manager, misunderstanding and common ignoring failures on master has lead us to yet another point release.
As it was stated on irc, there are Expectations from our releases. As was said another time, Raku is not "is it a vaporwave?" anymore and it went a long path from a project where a bunch of folks were committing code to something used in production and it is just "expected" by people we support different platforms (even when there are no checks for them).
If we want to ensure our releases met expectations, current development process which is, as shown above, prone to creating problematic situations, must be addressed.
Possible solutions
There are not so many solutions I can suggest, but I have one.
In the described problems, there are two sources of evil: 1)no check for some case we "suppose" we support; 2)the check shows red and was ignored.
To address these, we need to fix 1 and fix 2, changing current development culture, including:
- Development migrates into PR-first mode instead of committing to master. Master branch is protected from a PR merge if the CI checks are not green. Want to change something -> PR -> checks green ?? Review and merge !! Re-take.
- A check failure on master branch is considered as an extreme situation and we don't move forward until it is resolved.
- Do not rely on "We assume we support some rarer than usual platforms and try not to break them, but there are no real checks around" anymore. Establish a complete list of platforms and tools we Officialy Met Expectations for and add a clear CI check for every missing point of this list.
I know it may bring in some disapproval, saying that such restrictions are not fun for developers anymore, but I am sure that it is certainly not fun for developers to debug issues introduced 40 days ago and it is not fun for people to have troubles with packaging new releases and doing other wiring and it is not fun for us all to spend more time on fixing the consequences rather than spending less time on keeping our master branch healthy preventing the consequences in the first place.
On 22 Jun 2020, at 10:25, Altai-man [email protected] wrote:
Possible solutions
There are not so many solutions I can suggest, but I have one.
Change current development culture, including:
• Development migrates into PR-first mode instead of committing to master. Master branch is protected from a PR merge if the CI checks are not green. Want to change something -> PR -> checks green ?? Review and merge !! Re-take. • A check failure on master branch is considered as an extreme situation and we don't move forward until it is resolved. • Do not rely on "We assume we support some rarer than usual platforms and try not to break them, but there are no real checks around" anymore. Establish a complete list of platforms and tools we Officialy Met Expectations for and add a clear CI check for every missing point of this list. I know it may bring in some disapproval, saying that such restrictions is not fun for developers anymore, but I am sure that it is certainly not fun for developers to debug issues introduced 40 days ago and it is not fun for people to have troubles with packaging new releases and doing other wiring and it is not fun for us all to spend more time on fixing the consequences rather than spending less time on keeping our master branch healthy preventing the consequences in the first place.
I could live with that, provided that the CI is actually reliable. So far, I have seen waay more false positives from CI than I have seen false negatives. It's the false positives (when CI says there's something wrong, and it's the CI that is wrong) that are impeding development.
See https://github.com/rakudo/rakudo/issues/3700#issuecomment-629869968.
Once finished, it should help with these issues: Mar 15, Jun 5, and maybe Jun 6 because CI status should become more helpful.
Do not rely on "We assume we support some rarer than usual platforms and try not to break them, but there are no real checks around" anymore. Establish a complete list of platforms and tools we Officialy Met Expectations for and add a clear CI check for every missing point of this list.
I agree with this, but technically it also means running Blin on all of these platforms. Did you know we support mipsel? We should definitely strive towards more platforms being tested, but it's probably not possible to achieve the perfection here.
We should definitely strive towards more platforms being tested, but it's probably not possible to achieve the perfection here.
Establish a complete list of platforms and tools we Officialy Met Expectations for and add a clear CI check for every missing point of this list.
It's relatively easy with RakuDist. It has pluggable design in mind where spining up a new docker image and adding it to the system is not a big deal. So far it does tests for:
alpinedebiancentosubuntu
We could add more images to the list ( including old CentOS and any exotic that docker support, Sparrow is super flexible as well do deal with such a variety )
The question is what kind of test do we want to ensure another commit to MoarWM/rakudo does not break stuff. If one tells me what kind of test we need here I could start implementing this on RakuDist side.
Current test workflow is that:
- download Rakudo from whateverable
- run zef install for a certain module
But yes, we can support more testing scenarios, including building Rakudo/MoarWM from source, whatever ...
Further thoughts. I am thinking about pottage.raku.org - a service for Rakudo/MoarVM smoke testing for various Linuxes, so that for every commit we:
- build moarvm
- build rakudo
- run lightweight tests to ensure OS compatibly
Every test should run for every OS/distro in the list and should not take more say 5-10 minutes, so we could catch architecture/platform dependent bugs as soon as possible and report any problems as quick as possible.
I have all the components in place (UI/Job runner - Sparky/Backend and CM tool - docker + Sparrow ) - similar to RakuDist, so there is good base to start with ...
Further thoughts. I am thinking about pottage.raku.org - a service for Rakudo/MoarVM smoke testing for various Linuxes, so that for every commit we:
* build moarvm * build rakudo * run lightweight tests to ensure OS compatibly
Isn't it exactly what we do with current CI setups?
@AlexDaniel I don't know, maybe. The idea to do it for as many OS/distros/Envs as possible. Is that the case now?
AFAIK the overhaul of the build pipelines using Azure CI by @patrickbkr die cover the build of moarvm and rakudo and some testing.
I would like to extend it to automate the star release in the future too.
We don't need more CI tests. We don't need more target platforms.
What we need is reliable CI tests. They get ignored because most of the time we look at those results is wasted because it was yet another false positive. We've had Travis reporting to the IRC channel and we'd jump whenever it reported failure. But most of them were because of Travis not being able to clone a repository or other benign issues. So someone wrote an IRC bot to tell us whether a failure was a false positive or not but that vanished, too.
What we also need is more cooperation on existing CI infrastructure. We've had Travis for Linux and OS X and AppVeyor for Windows. Someone was dissatisfied because of some issues and we got CircleCI. So we've had Travis, AppVeyor and CirecleCI all with their issues. I've added https://build.opensuse.org/project/show/home:niner9:rakudo-git because I wanted coverage for issues that would appear only in packaged versions and also coverage of important modules. This regularily reports failures like t/nqp/111-spawnprocasync.t (Wstat: 6 Tests: 4 Failed: 0) and got ignored completely and instead we got this Azure Pipelines thing.
Really, please stop adding additional systems.
Instead, just make reports worth getting looked at then make it so we don't have to check 5 different websites with different user interfaces to get at the results and then start looking at the reports, point at broken commits and create reduced test cases.
@lizmat
So far, I have seen waay more false positives from CI than I have seen false negatives. It's the false positives (when CI says there's something wrong, and it's the CI that is wrong)
Then we have to eliminate false positives instead of ignoring the checks. I mean, false positives did happen, but somehow other communities do use CI, likely preventing problems with releases and broken master for weeks as we have. There are flappers in roast but nobody pushes the code without running it saying "There are flappers, so I won't" (or so I hope). :)
@AlexDaniel
I agree with this, but technically it also means running Blin on all of these platforms
Still better than suddenly breaking someone's code in the wild because we don't check. Working on a language as we do is full of Technical Difficulties anyway, this is one of them, I guess.
Did you know we support mipsel?
I had no idea and this is precisely the problem. Tomorrow someone's code on mipsel will break next release and we will be "Hmm, well, it is not stated anywhere, but I guess we kind of support that, let's patch and release again". I am not talking about perfection and I am sure suggested scheme won't eliminate point releases. However, I don't see how catching issues earlier can be seen as something wrong.
Even a wiki page stating explicitly "We support this, that, this and that" will help tremendously.
@rba
AFAIK the overhaul of the build pipelines using Azure CI by @patrickbkr die cover the build of moarvm and rakudo and some testing
Yes, this is an awesome piece of work, because we got testing for JVM and eventually relocability too. It's just that older gcc were not on the plate, which is hopefully fix-able. Then we can have one system to rule them all and given rakudo is reliably enough to not torture us with races, we would have a great tool in our toolbox.
@niner
Really, please stop adding additional systems.
We won't (I hope). More so, migration to Azure eliminated usages of Travis and AppVeyor, quite successfully, making the things easier.
Instead, just make reports worth getting looked at then make it so we don't have to check 5 different websites with different user interfaces to get at the results and then start looking at the reports, point at broken commits and create reduced test cases.
Yes. The intention here is to 1)make CI worth being respected; 2)make people see merits of it..
Saying "Current CI is bad, so one wouldn't want to use it" is odd compared to "Current CI is bad, so we should improve it and use it".
I agree with this, but technically it also means running Blin on all of these platforms do we run Blin for ALL modules in eco system? So it takes hours for one run, does it?
My idea is to have lightweight smoke tests run ( with average run time no more 5-10 minutes ) for all supported platforms per every commit. Is it a case now for any of the mentioned CI (Azure Devops/Circle/Travis )?
TL;DR: 1)I don't want to cover everything in the world, add platforms, add tools. I want us to clarify what we support and what not currently and make our current tools to check if the release is worthy using this checklist. 2)I don't want to spend precious time of developers more than required, so our CI should be healthy and it should be the means for avoiding breakage of master (which is not so uncommon right now as was shown in the examples).
Flappers are acceptable red flags for me. Stupid things like connectivity issues breaking builds, are not. :-) And I've looked at way too many of those.
@Altai-man I understand all that, and I agree with all you've said. But I still need some clarification here (from you or from others), for example you said "a critical bug which lead to build failures when linking against musl, glibc alternative notably used by Alpine distro popular for CI containers"
So, do we have a test where check a source code compatibility with Alpine? So on so forth ( you can think of some other examples, say some CentOS distros that we claim to support ).
I've found a set of OS supported in Azure Pipelienes CI for Rakudo build
https://github.com/rakudo/rakudo/blob/master/azure-pipelines.yml#L41-L97
Don't see an Alpine/CentOS/Debian here
The same for moarvm - https://github.com/MoarVM/MoarVM/blob/master/azure-pipelines.yml
cc @patrickbkr
I am not picking holes 😄 , it probably works fine for the purpose of testing moar backend / rakudo in general. But probably does not cover some OS dependent issues mentioned here ...
To start the discussion of what platforms we want to have automated tests for, I have put together a list. This is open for discussion.
- x86_64 Windows 10 (actually Microsoft Windows Server 2019 Datacenter, link)
- [x] Java 11
- [x] Visual Studio 2019
- x86_64 Ubuntu 18.04 (link)
- [x] GCC 9
- [x] Clang 9
- x86_64 MacOS 10.15 (link)
- [x] GCC 9
- [x] Java 11
- x86_64 Docker CentOS 6
- [ ] GCC 4
- x86_64 Docker Alpine latest
- [ ] GCC latest (currently 9)
- [ ] Some big endian system
- The platform we actually want to support here is IBM System z. No chance we get our hands on one of those.
- SPARCs are really cheap on ebay (e.g. Sparc T3 CPU 16 Core 1,65Ghz 32GB RAM - 235 €)
- I'd hope we are able to get Debian on this. People are familiar with Debian in contrast to say Solaris. Debian has an unofficial SPARC64 port.
- Getting the AzureCI runner working on one of those systems might be challenging.
- [ ] Some ARM system
- I propose a RasPi 4 with 4GB RAM, Debian armhf 32bits - such builds will run on Raspbian
- We could also add another RasPi with arm64.
- RasPi is an officially supported platform for the AzureCI runner.
Open questions with the above list:
- Did I miss an environment we want to support?
- Do we really want to start setting up our own hardware for testing? Who is going to host / pay the rack space and electricity? @rba ?
- Who would pay for the hardware?
- Which test would we want to run? Currently such platformy tests (apart from OS) are only present in the MoarVM CI setup. Thus only NQP tests are run. Will this suffice to make sure our stack works on these OSes? Or do we want a full rakudo test?
MacOS 11 with ARM processor as soon as there is one available?
Also I do agree with niner and lizmat that our biggest problem with CI currently is reliability. We need to have a stable CI that people are willing to not ignore.
In that regard I'd like to focus on Azure and get rid of the others. Currently Azure isn't fully reliable (see this comment). I hope we'll manage to iron these failures out. - Soonish.
- [ ] Some big endian system
- The platform we actually want to support here is IBM System z. No chance we get our hands on one of those.
That is simply not true. As I've pointed out repeatedly, the Open Build Service supports a long list of platforms out of the box on its 12000 machine strong build farm. This includes openSUSE Factory zSystems. It was literally 3 mouse clicks to activate the build an System z and if you're interested in the results, they are right here:
https://build.opensuse.org/package/live_build_log/home:niner9:rakudo-git/ moarvm/openSUSE_Factory_zSystems/s390x
Why this gets ignored is completely beyond me.
To make it absolutely crystal clear, this is the full list of currently available buid targets of the Open Build Service:
openSUSE Tumbleweed
openSUSE Leap 15.2 openSUSE Leap 15.1 openSUSE Leap 15.1 ARM openSUSE Leap 15.1 PowerPC openSUSE Factory ARM openSUSE Factory PowerPC openSUSE Factory zSystems openSUSE Backports for SLE 15 SP1 openSUSE Backports for SLE 15 openSUSE Backports for SLE 12 SP5 openSUSE Backports for SLE 12 SP4 openSUSE Backports for SLE 12 SP3 openSUSE Backports for SLE 12 SP2 openSUSE Backports for SLE 12 SP1 openSUSE Backports for SLE 12 SP0
SUSE SLE-15-SP1 SUSE SLE-15 SUSE SLE-12-SP5 SUSE SLE-12-SP4 SUSE SLE-12-SP3 SUSE SLE-12-SP2 SUSE SLE-12-SP1 SUSE SLE-12 SUSE SLE-11 SP 4 SUSE SLE-10
Arch Extra Arch Community
Raspbian 10 Raspbian 9.0
Debian Unstable Debian Testing Debian 10 Debian 9.0 Debian 8.0 Debian 7.0
Fedora Rawhide (unstable) Fedora 32 Fedora 31 Fedora 30 Fedora 29
ScientificLinux 7 ScientificLinux 6
RedHat RHEL-7 RedHat RHEL-6 RedHat RHEL-5
CentOS CentOS-8-Stream CentOS CentOS-8 CentOS CentOS-7 CentOS CentOS-6
Ubuntu 20.04 Ubuntu 19.10 Ubuntu 19.04 Ubuntu 18.04 Ubuntu 16.04 Ubuntu 14.04
Univention UCS 4.4 Univention UCS 4.3 Univention UCS 4.2 Univention UCS 4.1 Univention UCS 4.0 Univention UCS 3.2
Mageia Cauldron (unstable) Mageia 7 Mageia 6 IBM PowerKVM 3.1 AppImage
KIWI image build (to be used for appliance and product builds with kiwi)
@niner From my understanding OBS is a build service and not a CI service. Did I misunderstand? Is it viable to try to use OBS as a CI?
On Donnerstag, 25. Juni 2020 13:47:24 CEST Patrick Böker wrote:
@niner From my understanding OBS is a build service and not a CI service. Did I misunderstand?
Yes
Is it viable to try to use OBS as a CI?
Yes. I've explicitly cleared this with the OBS folks at FOSDEM and have been using it as CI service since January.
Judging by the above list the OBS could be used as a CI and build platform for about everything except MacOS and Windows.
@niner It seems OBS doesn't really market itself as a CI. The user documentation has near to nothing on the topic of using it as such. I suspect one has to bend the system into being a CI a bit. Am I right? Things I didn't find any information about:
- Building PRs
- Reporting results back to GitHub
- Viewing testresults ordered by commit in OBS itself
There is a 2013 talk by Ralf Dannert mentioning a Jenkins integration, but information on that is just as sparse.
Edit: I am interested in looking into this more. I'd really appreciate some more information on the topic though.
On Donnerstag, 25. Juni 2020 14:55:31 CEST Patrick Böker wrote:
@niner It seems OBS doesn't really market itself as a CI. The user documentation has near to nothing on the topic of using it as such. I suspect one has to bend the system into being a CI a bit. Am I right?
I'm using a cron job and a modified version of my packaging scripts to push every commit to MoarVM, nqp and rakudo to the OBS for testing. The OBS will then rebuild those 3 and the 21 modules (most importantly Inline::Perl5 and Cro) I'm most interested in.
There's also something where you can point it at a git repo, but since I already had working scripts, I didn't dig into this:
https://openbuildservice.org/help/manuals/obs-user-guide/cha.obs.best-practices.scm_integration.html
https://openbuildservice.org/help/manuals/obs-user-guide/ cha.obs.source_service.html
Advantages I see there are the flexibility of being able to build pretty much whatever I want, including PRs and branches and even patched versions, that the OBS takes care of dependencies, i.e. building stuff in order and that the build itself doesn't have to do any git operations and doesn't access the network at all. That means that builds simply cannot fail due to some git host not answering.
- Building PRs
- Reporting results back to GitHub
The OBS has both a pretty decent command line client and a REST API:
nine@ns1:~/home:niner9:rakudo-git/moarvm> osc results --vertical
openSUSE_Tumbleweed i586 succeeded
openSUSE_Tumbleweed x86_64 succeeded
openSUSE_Leap_15.2 x86_64 succeeded
openSUSE_Leap_15.1 x86_64 succeeded
openSUSE_Factory_zSystems s390x failed
openSUSE_Factory_PowerPC ppc64 succeeded
openSUSE_Factory_PowerPC ppc64le succeeded
openSUSE_Factory_ARM armv7l succeeded
openSUSE_Factory_ARM aarch64 succeeded
nine@ns1:~/home:niner9:rakudo-git/moarvm> osc results --xml
<resultlist state="b7680636458e1e15dfa277cb5c133ee5">
<result project="home:niner9:rakudo-git" repository="openSUSE_Tumbleweed"
arch="i586" code="published" state="published">
<status package="moarvm" code="succeeded"/>
</result>
<result project="home:niner9:rakudo-git" repository="openSUSE_Tumbleweed"
arch="x86_64" code="published" state="published">
<status package="moarvm" code="succeeded"/>
</result>
<result project="home:niner9:rakudo-git" repository="openSUSE_Leap_15.2"
arch="x86_64" code="published" state="published">
<status package="moarvm" code="succeeded"/>
</result>
<result project="home:niner9:rakudo-git" repository="openSUSE_Leap_15.1"
arch="x86_64" code="published" state="published">
<status package="moarvm" code="succeeded"/>
</result>
<result project="home:niner9:rakudo-git"
repository="openSUSE_Factory_zSystems" arch="s390x" code="published"
state="published">
<status package="moarvm" code="failed"/>
</result>
<result project="home:niner9:rakudo-git"
repository="openSUSE_Factory_PowerPC" arch="ppc64" code="published"
state="published">
<status package="moarvm" code="succeeded"/>
</result>
<result project="home:niner9:rakudo-git"
repository="openSUSE_Factory_PowerPC" arch="ppc64le" code="published"
state="published">
<status package="moarvm" code="succeeded"/>
</result>
<result project="home:niner9:rakudo-git" repository="openSUSE_Factory_ARM"
arch="armv7l" code="published" state="published">
<status package="moarvm" code="succeeded"/>
</result>
<result project="home:niner9:rakudo-git" repository="openSUSE_Factory_ARM"
arch="aarch64" code="published" state="published">
<status package="moarvm" code="succeeded"/>
</result>
</resultlist>
nine@sphinx:~> lwp-request https://api.opensuse.org/build/home:niner9:rakudo-git/openSUSE_Tumbleweed/x86_64/rakudo/_status
Enter username for Use your SUSE developer account at api.opensuse.org:443:
niner9
Password:
<status package="rakudo" code="succeeded">
<details></details>
</status>
Please, everyone, take a serious look at OBS. It makes a lot of our crutches obsolete. In fact, even Blin will probably be rendered useless once we have all modules packaged in OBS.
once we have all modules packaged in OBS.
It's not like I am against OBS or other tools, but packaging Raku modules into native packages is out of the scope for the issue being discussed here.
@melezhik it's not out of scope, that's just how OBS works. If we create a new rakudo package for every commit, it can trigger a rebuild of all module packages (on all architectures). That's essentially what Blin does, except that OBS can do it for all supported architectures without requiring us to create our own infrastructure for it. It actually sounds a bit too good to be true, but according to @niner we are allowed to do something like this, so let's try it.
When I was looking on how to create rakudo packages, OBS was the first thing I looked at. Huge platform support, backed by a FOSS company, etc. However, I found it very complicated.
This does not mean I think OBS is a bad idea. In fact, I prefer it to Microsoft Azure. I am just stating the importance of documentation and transfer of knowledge, because I suspect that @niner++ is the only expert of the platform.
@AlexDaniel I see what you're saying and with all respect what @niner has been doing with OSB, just my thoughts:
If we create a new rakudo package for every commit, it can trigger a rebuild of all module packages (on all architectures).
you don't need this to test Moar/Rakudo, it's only makes a sense if one is going to support Raku modules for certain platform
OBS can do it for all supported architectures without requiring us to create our own infrastructure for it.
Let's be real. There is no such a tool that automatically generates all platform specific packages from META specs. Even though there is AFIK progress in that way with rhel/centos presented by @niner , we should understand that it's way too harder the we could expect now, it's even hard to to it for a certain platform, there are too many bumps on road we might be not aware of now. Again do we still need it? If we are going to maintain native packages for different Linux, then it makes a sense. However I personally don't want to build a native CentOS package for Rakudo package just to test it ... But there is somewhat in the middle approach I am currently working on discussed here one might be interested ...
There is no such a tool that automatically generates all platform specific packages from META specs
I'll submit PRs to all modules that need native dependencies. No problem.