cortx
cortx copied to clipboard
OVA cannot survive reboot
Problem
I was running OVA PI-7 and when I rebooted my machine for windows update the OVA would not work on restart.
[root@cortx-ova-rgw ~]# hctl status The connection to the server 192.168.1.12:6443 was refused - did you specify the right host or port? error: pod, type/name or --filename must be specified
OVA worked fine before reboot. I was even able to suspend the OVA and resume and worked fine but once it is turned off once it does not come back online.
Expected behavior
OVA should be able to survive reboot
How to reproduce
Start OVA in VM client and reboot the VM.
Deployment information
VMWare Workstation Pro 16 OVA link
Additional information
Some examples of users facing this issue in the most recent Hackathon:
- This user in Slack had their VM stop working so they restarted it. Then the OVA stopped working completely and we had to assign them a new instance: https://cortxcommunity.slack.com/archives/C019YDQ8AKT/p1655561393936389
- This user also faced a similar issue where they VM became completely unusable and we had to assign them a fresh instance https://cortxcommunity.slack.com/archives/C01EJRHJK39/p1656357933697869
- This user also had the issue where their VM became unusable so they rebooted. I suspect the reason it became unusable was because the VM was set to turn off after a period of no usage. But then once the user rebooted the VM it was no longer useable and we had to assign them a fresh one: https://cortxcommunity.slack.com/archives/C01F5NS2820/p1656341676537719
For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/CORTX-33840. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.
@hessio, as per OVA instructions, did you wait for 5 to 10 mins till the cluster services are started after reboot?
Yes sometimes when you wait it will come back online but most of the time it will completely crash and become unusable
This issue/pull request has been marked as needs attention
as it has been left pending without new activity for 4 days. Tagging @mukul-seagate11 for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.
Yes sometimes when you wait it will come back online but most of the time it will completely crash and become unusable
We haven't observed or reported this issue during QA validation, also once its rebooted the K8s services takes times to come up which is after 10mins
Yes sometimes if you reboot and wait for some time maybe even 30 minutes it will not work.
Can you try and replicate the setup to make sure?
@mukul-seagate11 it sounds like the QA validation isn't catching some of these common user experiences. That's expected, of course. But then when these issues do turn up in the community, that's an opportunity to find out what the QA testing missed. Surely the process should be DA replicates (done) - > raises issue -> someone on the engineering team replicates and attempts to ID the issue -> then if they can identify it, we move forward, and if they can't, we go back and attempt to identify the differences between what you did and what we did that lead to different failure/success states, yeah?
CC @rshenoy0831, can you check the test validation report if this test case was covered other will take in next release to be fixed?
@mukul-seagate11 yes, we have tested this scenario.
ova issues which are reported by community https://github.com/Seagate/cortx/issues?q=is%3Aopen+is%3Aissue+label%3Aova will be further raised to QA and QA will try to verify these issues in next sprint.
Haven't observed this issue as per https://github.com/Seagate/cortx/pull/1644
Its already validated as per test entry criteria in https://seagate-systems.atlassian.net/wiki/spaces/PRIVATECOR/pages/1170014209/PI-8+Test+Entry+Criteria+Dev+OVA+Community+Build