camel-k fix #4948 part of #3397

Fix for #4948 that is part of #3397

A platformcontroller command is added as subcommand of kamel; it runs an operator like the operator subcomand does, but platformcontroller would just handle IntegrationPlatform CRs. All the other CRDs are handled by operator as before.

These installation methods have been amended accordingly:

install command
olm bundle
helm charts

e2e tests has been ran locally as well.

Release Note

NONE

Jan 29 '24 17:01 valdar

What about these tests:

❌ TestMetrics (3m0.27s) ❌ TestMetrics/reconciliation_duration_metric (10ms)

they seems flaky to me, first run they were passing, they are passing locally, but now failing in CI :/

They are also marked as "problematic" but CI runs without the skip problematic config? :/

Feb 03 '24 08:02 valdar

What about these tests:

❌ TestMetrics (3m0.27s) ❌ TestMetrics/reconciliation_duration_metric (10ms)

they seems flaky to me, first run they were passing, they are passing locally, but now failing in CI :/

They are also marked as "problematic" but CI runs without the skip problematic config? :/

No, I don't think they are flaky. It fails at line:

Expect(platformReconciled).NotTo(BeNil())

I suspect that the operator perform and provide metrics which we may have lost if running the reconciliation loop outside the main process.

Feb 05 '24 08:02 squakez

:warning: Unit test coverage report - coverage decreased from 35.8% to 35.7% (-0.1%)

Feb 15 '24 21:02 github-actions[bot]

:warning: Unit test coverage report - coverage decreased from 35.8% to 35.7% (-0.1%)

Feb 15 '24 23:02 github-actions[bot]

@valdar @rinaldodev mind rebasing ?

Feb 27 '24 09:02 lburgazzoli

LGTM

Mar 04 '24 10:03 lburgazzoli

The main concern I have is about code duplication (maintainability) and performance loss. About the first, I think it would be a matter of abstracting the code from the operator and be able to call it from different places.

I have abstracted as much code as I could in the cmd/controller/manager.go interface and implementations. Hope it is a suitable solution.

About performances I guess it's a bit more complicated. IIUC, so far we're having two separate Pods with 2 separate caches. I honestly don't think that the platform controller has to watch at any resource at all, but only owns the IntegrationPlatform resource. If it requires to watch all those resources for any reason, then, we should understand if we can share the cache somehow.

All not needed resources have been removed from watching in cache.

Additionally, I wonder if instead of having a new deployment, we can simply have a sidecar container to take care of controlling this part.

@rinaldodev is already working on follow ups of this PR that requires the platform controller to be a separate operator, in fact the one that will bootstrap the camel k operators. So I think that is not worth to explore a sidecar option for the relative short time needed to complete that chunk of work.

Mar 04 '24 14:03 valdar

@valdar I don't have enough context to review this one, sorry. The immediate impression I have is that from community perspective it would be good to have some design diagram and some user documentation to illustrate how this new model works, above all what it implies in terms of resources, monitoring and in general any operational aspect that it would be required as we move to a new approach with multiple Deployments. Also some explanation on the "why" this work is beneficial may be helpful to be eventually welcomed by any user that may not agree with this new approach and such a change that is going to happen on a minor release version.

Mar 04 '24 16:03 squakez

I've enabled Quarkus Native tests. However it's very likely they fail due to some infrastructure problem that we have not yet tackled. In such circumstances, please, run the Quarkus Native test manually in your local environment and report result here.

Mar 07 '24 16:03 squakez

:warning: Unit test coverage report - coverage decreased from 37.2% to 37.1% (-0.1%)

Mar 18 '24 17:03 github-actions[bot]

@christophd @squakez I guess this is now in a good shape to be merged

Mar 19 '24 10:03 lburgazzoli

@christophd @squakez I guess this is now in a good shape to be merged

I haven't received any feedback about my last 2 comments. We need to run manually the native checks and it would be nice some documentation or explanation that the final user will require for sure when moving to a new deployment model. Thanks.

Mar 19 '24 10:03 squakez

Also, several tests have failed and they don't seem to be the usual flaky ones. I've restarted them.

Mar 19 '24 10:03 squakez

Also, several tests have failed and they don't seem to be the usual flaky ones. I've restarted them.

The native tests seems to be failing constantly on OSX, can't we skip testing native images on OSX ? given not everyone has an OXS I wonder how actually one could run them locally

Mar 19 '24 10:03 lburgazzoli

Also, several tests have failed and they don't seem to be the usual flaky ones. I've restarted them.

The native tests seems to be failing constantly on OSX, can't we skip testing native images on OSX ? given not everyone has an OXS I wonder how actually one could run them locally

They can run on any OS. The Macos is only on github actions. We are executing in our local machines while we don't have a fix on the GH infra.

Mar 19 '24 10:03 squakez

Also, several tests have failed and they don't seem to be the usual flaky ones. I've restarted them.

The native tests seems to be failing constantly on OSX, can't we skip testing native images on OSX ? given not everyone has an OXS I wonder how actually one could run them locally

They can run on any OS. The Macos is only on github actions. We are executing in our local machines while we don't have a fix on the GH infra.

Can we maybe disable native OSX tests on CI till we have a fix ?

Mar 19 '24 10:03 lburgazzoli

@lburgazzoli the problem is not the failure of the checks per se. As we've done in other PR, everything that may affect the Quarkus native has to be validated manually in order to make sure we don't introduce any regression. Please, have a look at this PR for instance: https://github.com/apache/camel-k/pull/5234#issuecomment-1985493127

If the folks have any problem running the native tests, you or I can run them and publish here the results, although, as suggested to some other contributor, it would be good to be able to do it on their own as it's should be a normal practice to do when doing some development. Running locally it is as easy as running any other integration test [1], it is only taking a bit of time more.

Beside, that normal test have failed and I've restarted, so, we need common to finish and be green.

Beside, as mentioned in [2] my personal feeling is that these kind of changes would require some documentation and some rationale explanation as the users that will move from 2.2 to 2.3 will have the surprise of a multi deployment that may require them to adjust monitoring and other operational aspects. To me this is a breaking change, but if you want to proceed, at least it has to be clearly explained to the final users.

[1] https://camel.apache.org/camel-k/2.2.x/contributing/e2e.html#testing-e2e-structure [2] https://github.com/apache/camel-k/pull/5119#issuecomment-1976919321

Mar 19 '24 11:03 squakez

@lburgazzoli the problem is not the failure of the checks per se. As we've done in other PR, everything that may affect the Quarkus native has to be validated manually in order to make sure we don't introduce any regression. Please, have a look at this PR for instance: #5234 (comment)

If the folks have any problem running the native tests, you or I can run them and publish here the results, although, as suggested to some other contributor, it would be good to be able to do it on their own as it's should be a normal practice to do when doing some development. Running locally it is as easy as running any other integration test [1], it is only taking a bit of time more.

I'm not complaining about having to run them locally, I'm just noticing the OSX build is almost always failing so I wonder what is the value of such test. Now I can run the test locally on my linux box and reporting them as ok but, what I've testes is not what has failed so people may spend time waiting and trying to find out why things are not working on OSX whereas it is an environment issue.

Beside, as mentioned in [2] my personal feeling is that these kind of changes would require some documentation and some rationale explanation as the users that will move from 2.2 to 2.3 will have the surprise of a multi deployment that may require them to adjust monitoring and other operational aspects. To me this is a breaking change, but if you want to proceed, at least it has to be clearly explained to the final users.

I agree that we need some docs but since this PR is part of a larger issue and the rationale was briefly explained in https://github.com/apache/camel-k/issues/3397#issuecomment-1735525398, then I wonder if the doc could be something we do as a separate PR.

Mar 19 '24 13:03 lburgazzoli

I agree that we need some docs but since this PR is part of a larger issue and the rationale was briefly explained in #3397 (comment), then I wonder if the doc could be something we do as a separate PR.

In theory we want to release 2.3 by the end of the month. If we ends up merging this we'd include in 2.3 which, IIUC would not be either complete as still missing pieces (beside documentation). My suggestion would be to wait the release of 2.3 then, and to merge right after so you can focus on the rest of development to be ready for 2.4. Or, if you want to push this in 2.3 complete with the required documentation.

Mar 19 '24 14:03 squakez

I'm not complaining about having to run them locally, I'm just noticing the OSX build is almost always failing so I wonder what is the value of such test. Now I can run the test locally on my linux box and reporting them as ok but, what I've testes is not what has failed so people may spend time waiting and trying to find out why things are not working on OSX whereas it is an environment issue.

I've commented this aspect already some time ago in the comment: https://github.com/apache/camel-k/pull/5119#issuecomment-1983848486 apologies if it was not clear enough. I'm taking the task to notify this to any PR that may affect the Quarkus native checks. Unfortunately, while they are not fixed, the best way to handle this is to run them manually and on local contributors environments.

By the way, just to make this clear, I'm doing the same for my own PRs and I'll have to do this process at the time of cutting the release.

Mar 19 '24 14:03 squakez

I agree that we need some docs but since this PR is part of a larger issue and the rationale was briefly explained in #3397 (comment), then I wonder if the doc could be something we do as a separate PR.

In theory we want to release 2.3 by the end of the month. If we ends up merging this we'd include in 2.3 which, IIUC would not be either complete as still missing pieces (beside documentation). My suggestion would be to wait the release of 2.3 then, and to merge right after so you can focus on the rest of development to be ready for 2.4. Or, if you want to push this in 2.3 complete with the required documentation.

I think this is more for 3.x than 2.x, wonder if we should branch 2.x now and move main toward 3.x

Mar 19 '24 14:03 lburgazzoli

I agree that we need some docs but since this PR is part of a larger issue and the rationale was briefly explained in #3397 (comment), then I wonder if the doc could be something we do as a separate PR.

In theory we want to release 2.3 by the end of the month. If we ends up merging this we'd include in 2.3 which, IIUC would not be either complete as still missing pieces (beside documentation). My suggestion would be to wait the release of 2.3 then, and to merge right after so you can focus on the rest of development to be ready for 2.4. Or, if you want to push this in 2.3 complete with the required documentation.

I think this is more for 3.x than 2.x, wonder if we should branch 2.x now and move main toward 3.x

No problem for me. Mind that there may be some work to perform on the actions as well (nightly upgrades, releases, etc...) and likely this has to be communicated and accepted by the community before.

Mar 19 '24 15:03 squakez

@valdar please rebase against main again. The Quarkus native checks have been fixed, so you won't need to run them manually any longer. Thanks.

Mar 20 '24 09:03 squakez

:warning: Unit test coverage report - coverage decreased from 37.3% to 37.2% (-0.1%)

Mar 20 '24 10:03 github-actions[bot]

:warning: Unit test coverage report - coverage decreased from 37.8% to 37.7% (-0.1%)

Apr 09 '24 12:04 github-actions[bot]

:warning: Unit test coverage report - coverage decreased from 37.8% to 37.7% (-0.1%)

Apr 11 '24 14:04 github-actions[bot]

:warning: Unit test coverage report - coverage decreased from 38% to 37.9% (-0.1%)

Apr 16 '24 10:04 github-actions[bot]

:warning: Unit test coverage report - coverage decreased from 38% to 37.9% (-0.1%)

Apr 16 '24 16:04 github-actions[bot]

:warning: Unit test coverage report - coverage decreased from 38% to 37.9% (-0.1%)

Apr 17 '24 11:04 github-actions[bot]

:warning: Unit test coverage report - coverage decreased from 38% to 37.9% (-0.1%)

Apr 17 '24 13:04 github-actions[bot]