katib icon indicating copy to clipboard operation
katib copied to clipboard

fix: Resolve errors in e2e tests for cypress in Katib UI

Open vector-flow opened this issue 1 year ago • 26 comments

What this PR does / why we need it: This PR fixed the errors in e2e tests for cypress.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #2381

Checklist:

  • [ ] Docs included if any changes are user facing

vector-flow avatar Jul 15 '24 07:07 vector-flow

This PR relies on a commit in my forked repository: https://github.com/kubeflow/kubeflow/compare/master...tariq-hasan:kubeflow:fix-katib-ui-tests.

The commit needs to be merged into the kubeflow repository and the COMMIT file in this PR needs to be updated with the latest commit tag after the merge.

Also the addition of the process package in components/crud-web-apps/common/frontend/kubeflow-common-lib/package.json and the update to the associated package-lock.json may have repercussions for crud-web-apps. This needs to be assessed further.

vector-flow avatar Jul 15 '24 08:07 vector-flow

The e2e tests involving Cypress have passed: https://github.com/kubeflow/katib/actions/runs/9935880934?pr=2384.

There is a failed test for Katib UI: https://github.com/kubeflow/katib/actions/runs/9935880964/job/27442941694?pr=2384.

This is coming from the npm ci commands in https://github.com/kubeflow/katib/blob/master/cmd/ui/v1beta1/Dockerfile.

I am investigating the issue and will have it fixed.

vector-flow avatar Jul 15 '24 08:07 vector-flow

Thank you so much for investigating this and for this fix @tariq-hasan! Please can you submit PR into kubeflow/kubeflow to add this library so we can take the appropriate commit for Katib repo?

andreyvelich avatar Jul 15 '24 13:07 andreyvelich

Thank you so much for investigating this and for this fix @tariq-hasan! Please can you submit PR into kubeflow/kubeflow to add this library so we can take the appropriate commit for Katib repo?

Sounds good. I'll try to fix the failed Katib UI test so that I can identify if any other code changes are required for kubeflow/kubeflow. I am then creating a PR to merge the code changes in kubeflow/kubeflow and will then update this PR for review.

Failed Katib UI test: https://github.com/kubeflow/katib/actions/runs/9935880964/job/27442941694?pr=2384

vector-flow avatar Jul 15 '24 23:07 vector-flow

/rerun-workflow "Publish Katib Core Images"

andreyvelich avatar Jul 16 '24 10:07 andreyvelich

/rerun-all

andreyvelich avatar Jul 16 '24 11:07 andreyvelich

@tariq-hasan @tenzen-y It looks like /rerun-all is working only right now. We might need to wait until this one will be merged to rerun specific workflow: https://github.com/estroz/rerun-actions/pull/3

andreyvelich avatar Jul 16 '24 11:07 andreyvelich

Thank you for the clarification.

I did some further investigation on the Publish Katib Core Images workflow.

I found that the workflow fails within the final stage of the Dockerfile due to the results coming from npm ci.

If I replace npm ci with npm install --no-package-lock the workflow succeeds.

That said, npm ci uses package-lock.json instead of package.json and is therefore a more stringent validation of the configuration in the frontend folder. In addition, npm ci is less resource-intensive compared to npm install. As such, I suppose that npm install --no-package-lock is not a feasible option.

As npm install --no-package-lock succeeds and npm ci fails the issue could be due to package-lock.json not working in the node:14 base image as it was generated in a different environment on my local.

I will try to do some more investigation to determine why the build is failing and find an appropriate fix.

vector-flow avatar Jul 16 '24 14:07 vector-flow

Thank you for this deep investigation @tariq-hasan!

@kimwnasptd @elenzio9 @orfeas-k @kubeflow/wg-data-leads @ederign @alexcreasy cc some UI folks on problems with Katib and Kubeflow UIs. Please can you help us to resolve problems with npm and node ?

andreyvelich avatar Jul 16 '24 18:07 andreyvelich

@tariq-hasan @andreyvelich I think the answer is already highlighted in that previous commnet

package-lock.json not working in the node:14 base image as it was generated in a different environment on my local.

If the build succeeds with npm install --no-package-lock, we 'd need to update package-lock.json file (and commit it) from the same node and npm version with the one being used in the CI.

orfeas-k avatar Jul 17 '24 06:07 orfeas-k

What changes do we need to make for Kubeflow UI library ?

andreyvelich avatar Jul 17 '24 13:07 andreyvelich

@andreyvelich I believe what @orfeas-k meant is that @tariq-hasan generated package-lock.json file in the PR needs to match the node (and npm) version of the node where the 'Publish Katib Core Images' is being executed.

@tariq-hasan, could you please double-check that?

Also, I'm new to the project, but is there a reason why we are upgrading to node 14 instead of node 16 like kubeflow/kubeflow is using?

ederign avatar Jul 18 '24 12:07 ederign

Also, I'm new to the project, but is there a reason why we are upgrading to node 14 instead of node 16 like kubeflow/kubeflow is using?

I think, the common Kubeflow frontend component that we use doesn't use Node 16 today: https://github.com/kubeflow/kubeflow/blob/master/components/crud-web-apps/common/frontend/kubeflow-common-lib/package.json#L62

@tariq-hasan Did you check if cypress tests can be succeeded on Node 16 ? If yes, we should just update it.

andreyvelich avatar Jul 18 '24 14:07 andreyvelich

I am upgrading kubeflow-common-library in crud-webapps as well as katib-ui to node 16.

That said, the following components are still using node 12:

I will also try to reproduce the build from a Dockerfile on my local to ensure that the package-lock.json is correctly generated.

vector-flow avatar Jul 19 '24 10:07 vector-flow

/rerun-all

vector-flow avatar Jul 29 '24 23:07 vector-flow

/rerun-all

vector-flow avatar Jul 30 '24 01:07 vector-flow

Using package-lock.json from a build of the Dockerfile did not fix the errors.

However, upgrades of both crud-webapps/common/frontend/kubeflow-common-library and katib-ui to node 16 fixed the errors.

Please let me know if there are any feedback.

I will then create a PR to move kubeflow-common-library to node 16.

vector-flow avatar Jul 30 '24 01:07 vector-flow

Yes, @tariq-hasan please create PR into kubeflow/kubeflow to update Node version for common library.

andreyvelich avatar Jul 30 '24 14:07 andreyvelich

Got it. As part of the PR, I will upgrade the following components from version 12 to version 16 to ensure compatibility with kubeflow-common-library and for the CI tests in kubeflow/kubeflow to pass successfully.

vector-flow avatar Jul 31 '24 00:07 vector-flow

Got it. As part of the PR, I will upgrade the following components from version 12 to version 16 to ensure compatibility with kubeflow-common-library and for the CI tests in kubeflow/kubeflow to pass successfully.

Thank you for the great investigation! Once you open the PR in the kubeflow/kubeflow repository, please put the link on this PR.

tenzen-y avatar Jul 31 '24 06:07 tenzen-y

@tariq-hasan Any success with submitting PR into kubeflow/kubeflow so we can fix the e2e tests in Katib UI ?

andreyvelich avatar Aug 07 '24 13:08 andreyvelich

I apologize for the delay. I've updated the configuration for tensorboards. I am working through jupyter, volumes and centraldashboard-angular, and will create a PR in kubeflow/kubeflow shortly.

vector-flow avatar Aug 14 '24 17:08 vector-flow

Thank you @tariq-hasan, appreciate your time for it!

andreyvelich avatar Aug 15 '24 17:08 andreyvelich

/rerun-all

vector-flow avatar Aug 24 '24 15:08 vector-flow

I have created a PR for kubeflow/kubeflow: https://github.com/kubeflow/kubeflow/pull/7637.

This PR upgrades the node version from 12 to 16 for kubeflow-common-lib and its dependent components: jupyter, tensorboards, volumes and centraldashboard-angular.

vector-flow avatar Aug 24 '24 18:08 vector-flow

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 22 '24 20:11 github-actions[bot]

/remove-lifecycle stale

andreyvelich avatar Nov 23 '24 02:11 andreyvelich

@andreyvelich @tenzen-y I was wondering if you have any thoughts on why the E2E Test with enas-cnn-cifar10 could be failing.

I have re-run it multiple times but to no success.

I have not seen this test fail before so this is quite odd.

vector-flow avatar Nov 29 '24 03:11 vector-flow

/rerun-all

vector-flow avatar Dec 03 '24 03:12 vector-flow

/lgtm /approve

andreyvelich avatar Dec 03 '24 14:12 andreyvelich