cortx icon indicating copy to clipboard operation
cortx copied to clipboard

OVA cannot survive reboot

Open hessio opened this issue 2 years ago • 3 comments

Problem

I was running OVA PI-7 and when I rebooted my machine for windows update the OVA would not work on restart.

[root@cortx-ova-rgw ~]# hctl status The connection to the server 192.168.1.12:6443 was refused - did you specify the right host or port? error: pod, type/name or --filename must be specified

OVA worked fine before reboot. I was even able to suspend the OVA and resume and worked fine but once it is turned off once it does not come back online.

Expected behavior

OVA should be able to survive reboot

How to reproduce

Start OVA in VM client and reboot the VM.

Deployment information

VMWare Workstation Pro 16 OVA link

Additional information

Some examples of users facing this issue in the most recent Hackathon:

  • This user in Slack had their VM stop working so they restarted it. Then the OVA stopped working completely and we had to assign them a new instance: https://cortxcommunity.slack.com/archives/C019YDQ8AKT/p1655561393936389
  • This user also faced a similar issue where they VM became completely unusable and we had to assign them a fresh instance https://cortxcommunity.slack.com/archives/C01EJRHJK39/p1656357933697869
  • This user also had the issue where their VM became unusable so they rebooted. I suspect the reason it became unusable was because the VM was set to turn off after a period of no usage. But then once the user rebooted the VM it was no longer useable and we had to assign them a fresh one: https://cortxcommunity.slack.com/archives/C01F5NS2820/p1656341676537719

hessio avatar Aug 03 '22 13:08 hessio

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-33840. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

cortx-admin avatar Aug 03 '22 13:08 cortx-admin

@hessio, as per OVA instructions, did you wait for 5 to 10 mins till the cluster services are started after reboot?

mukul-seagate11 avatar Aug 05 '22 07:08 mukul-seagate11

Yes sometimes when you wait it will come back online but most of the time it will completely crash and become unusable

hessio avatar Aug 10 '22 16:08 hessio

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 4 days. Tagging @mukul-seagate11 for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

stale[bot] avatar Aug 16 '22 02:08 stale[bot]

Yes sometimes when you wait it will come back online but most of the time it will completely crash and become unusable

We haven't observed or reported this issue during QA validation, also once its rebooted the K8s services takes times to come up which is after 10mins

mukul-seagate11 avatar Aug 16 '22 04:08 mukul-seagate11

Yes sometimes if you reboot and wait for some time maybe even 30 minutes it will not work.

hessio avatar Aug 16 '22 08:08 hessio

Can you try and replicate the setup to make sure?

hessio avatar Aug 24 '22 16:08 hessio

@mukul-seagate11 it sounds like the QA validation isn't catching some of these common user experiences. That's expected, of course. But then when these issues do turn up in the community, that's an opportunity to find out what the QA testing missed. Surely the process should be DA replicates (done) - > raises issue -> someone on the engineering team replicates and attempts to ID the issue -> then if they can identify it, we move forward, and if they can't, we go back and attempt to identify the differences between what you did and what we did that lead to different failure/success states, yeah?

novium258 avatar Aug 24 '22 16:08 novium258

CC @rshenoy0831, can you check the test validation report if this test case was covered other will take in next release to be fixed?

mukul-seagate11 avatar Aug 25 '22 04:08 mukul-seagate11

@mukul-seagate11 yes, we have tested this scenario.

rshenoy0831 avatar Aug 25 '22 17:08 rshenoy0831

ova issues which are reported by community https://github.com/Seagate/cortx/issues?q=is%3Aopen+is%3Aissue+label%3Aova will be further raised to QA and QA will try to verify these issues in next sprint.

hessio avatar Aug 31 '22 14:08 hessio

Haven't observed this issue as per https://github.com/Seagate/cortx/pull/1644

mukul-seagate11 avatar Sep 08 '22 11:09 mukul-seagate11

Its already validated as per test entry criteria in https://seagate-systems.atlassian.net/wiki/spaces/PRIVATECOR/pages/1170014209/PI-8+Test+Entry+Criteria+Dev+OVA+Community+Build

mukul-seagate11 avatar Sep 16 '22 07:09 mukul-seagate11