vdsm icon indicating copy to clipboard operation
vdsm copied to clipboard

vdsm: get gluster volume info from any gluster peer

Open josgutie opened this issue 1 year ago • 7 comments

The function _get_gluster_volinfo query the glusterfs volume info the the storage server, this is translated to the gluster client adding the parameter --remote-host which limits the query to one server, so we are converting the storage server as a single point of failure, if it is not available, it can led to cluster outtage. The proposed changed let the cluster cli to use any available gluster peer.

josgutie avatar Mar 20 '24 12:03 josgutie

Why did you send a new PR when we have #408?

You need to answer my questions in #408. Please close this PR so we can continue the review on the existing PR.

Because the PR was bad formed, and I guess that it's more clean this PR.

Here is your question in the other PR,

I think this was added years ago because not specifying the server could be a problem. Not sure that the gluster cluster can cope with a case when one of the server is down.

Gluster cluster can cope with one of the server down, the issue is triggering because adding the --remote-host to the gluster cli limits the gluster query to this host, avoiding querying to the backup servers. Maybe it's a mechanism to avoid to register not updated info.

_Please check git history to understand when and why self.volfileserver was added.

In the git history the function was always called with the sef._volfileserver, as you can see here https://github.com/oVirt/vdsm/commit/907560b766835e585a162aa8b36d786e17f62384.

Also how did you test this change? Please describe what was tested - how you simulated the issue and tested that this change solves the issue, and how you tested that this change does not cause regressions in other flows.

Regarding the tests, I will describe in a new comment if you don't mind.

josgutie avatar Mar 20 '24 13:03 josgutie

@josgutie ok lets continue here, but please do not close PRs and submit new one like this. If there was an issue with the previous PR it can be easily fixed by pushing a new version.

nirs avatar Mar 20 '24 14:03 nirs

  • Environment:
  • RHHI:
    • 1.8
    • ovirt-engine:
      • 4.5.3.10
    • RHVH:
      • 4.5.3.10
      • vdsm:
        • 4.50.3.9-1
      • vdsm-gluster:
        • 4.50.3.9-1
      • glusterfs:
        • 6.0-63
  • SPM: rhhi02.storage01.lab
  • Gluster Storage domains configuration:

$ /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "SELECT connection, mount_options FROM storage_server_connections;" connection | mount_options
-------------------------------+------------------------------------------------------------------ rhhi01.storage01.lab:/data | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab rhhi03.storage01.lab:/engine | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab rhhi03.storage01.lab:/vmstore | backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab

$ grep -E "storage|mnt_options" /etc/ovirt-hosted-engine/hosted-engine.conf storage=rhhi03.storage01.lab:/engine mnt_options=backup-volfile-servers=rhhi01.storage01.lab:rhhi02.storage01.lab

$ df -h | grep glusterSD rhhi03.storage01.lab:/engine 100G 14G 87G 14% /rhev/data-center/mnt/glusterSD/rhhi03.storage01.lab:_engine rhhi03.storage01.lab:/vmstore 150G 2.6G 148G 2% /rhev/data-center/mnt/glusterSD/rhhi03.storage01.lab:_vmstore rhhi01.storage01.lab:/data 50G 908M 50G 2% /rhev/data-center/mnt/glusterSD/rhhi01.storage01.lab:_data

josgutie avatar Mar 21 '24 11:03 josgutie

  • Issue:

    • Put on maintenance mode data storage domain.
    • Shutdown rhhi01 host.
    • data storage domain can't be activated.
  • Here are the tests completed and their result after applying the patch.

    • Scenario of the issue:

      • Result: SD data can be activated.
    • Environment power off:

      • Scenario:
        • Host rhhi01 is powered of, rest are active.
        • Shutdown managed by the ansible role rhv.shutdown_env.
      • Result:
        • Shutdown without any issues.
    • Environment power up

      • Only two hosts started, rhhi02 and rhhi03.
      • Result:
        • All Storage Domains are active
        • Cluster and Data Center active.

@nirs suggested testing a new deployment, it will take some time, but I guess that I could check the scenario.

josgutie avatar Mar 21 '24 13:03 josgutie

@nirs suggested testing a new deployment, it will take some time, but I guess that I could check the scenario.

Test looks good, and sure we want to test the common operations like creating a new storage domain, putting storage domain to maintenance and activating it.

But I think you change is not complete - it assumes that backup-volfile-servers= is configured with all the nodes. In this case it seems that gluster does use any of the nodes and does not need the volfileserver argument (otherwise it would not work). But if I remember correctly, backup-volfile-servers= is optional on engine side, so your change must handle the case when it is not configured on engine side.

So the logic seems to be:

if backup_volfile_servers:
    ommit the volfileserver argument
else:
    use the volfileserver argument

Also not using backup-volfile-servers seems very fragile, so we can consider warning about it when connecting to the storage.

nirs avatar Mar 21 '24 13:03 nirs

@nirs the gluster command doesn't have any parameter to point to the backup servers, you can review the following client documentation Gluster Command Line Interface. In fact it's hard to find the definition of the --remote-host gluster client parameter.

Regarding the suggested change, the problem is if the volfile server is down, the call will fail because --remote-host limits the call to the volfile server and omit calling to the backup servers.

We could deal with connection issue in that part of the code, but from my point of view, it's a gluster cli responsibility not vdsm.

I will create and destroy a gluster storage domain.

josgutie avatar Mar 21 '24 15:03 josgutie

New tests completed:

  • GlusterFS based Storage Domain creation:
    • Create a new Storage Domain from a gluster volume.
    • New domain parameters:
    • Name: data2
    • Storage Type: GlusterFS
    • Host: rhhi01.libvirt.lab
    • Path: rhhi01.libvirt.lab:/data2
    • Mount Options: backup-volfile-servers=rhhi02.storage01.lab:rhhi03.storage01.lab
    • Result: Storage Domain data2 created.
  • GlusterFS based Storage Domain maintenance:
    • Domain name: data2
    • Result: Correct
  • GlusterFS based Storage Domain activation:
    • Domain name: data2
    • Result: Correct
  • GlusterFS based Storage Domain detach:
    • Domain name: data2
    • Result: Correct
  • GlusterFS based Storage Domain attach:
    • Domain name: data2
    • Result: Correct
  • GlusterFS based Storage Domain remove:
    • Domain name: data2
    • Result: Correct, gluster volume is empty.

The only thing left to test is to perform a full RHHI 1.8 deployment.

josgutie avatar Mar 25 '24 12:03 josgutie

Hi, A new RHHI 1.8 deployment test completed. Scenario:

  • RHVH: rhvh-4.5.3.9
  • SHE: 4.5.0.7-0.9. Hypervisors: 3 Storage domains: engine, vmstore and data. Networking:
  • ovirtmgmt
  • gluster(MTU=9000)
  • livemigration Result: Deployment completed without any issue.

In this new deployment I've created a new brick, glusterfs volume and glusterfs storage domain.

If you need further details of the tests performed, do not hesitate to ask.

josgutie avatar Apr 02 '24 10:04 josgutie

/ost

aesteve-rh avatar Apr 02 '24 13:04 aesteve-rh

@josgutie Thanks, your test battery looks correct. @nirs wdyt? do you still have concerns regarding the update?

@josgutie Could you check the module tests that are failing in the pipelines? Probably expectations changed after the update.

aesteve-rh avatar Apr 02 '24 13:04 aesteve-rh

/ost

aesteve-rh avatar Apr 03 '24 12:04 aesteve-rh

@josgutie linter failed, mind the longer lines. Limit is 79 characters.

I was also wondering if you could add a small test that throws an exceptions and triggers the retry with the backup servers. So that at least we add some coverage for the new code.

aesteve-rh avatar Apr 10 '24 12:04 aesteve-rh

/ost

aesteve-rh avatar Apr 18 '24 12:04 aesteve-rh

/ost

aesteve-rh avatar Apr 19 '24 07:04 aesteve-rh

/ost

aesteve-rh avatar Apr 19 '24 12:04 aesteve-rh

/ost

aesteve-rh avatar Apr 22 '24 12:04 aesteve-rh

/ost el8stream basic-suite-master

tinez avatar Apr 22 '24 13:04 tinez

/ost basic-suite-master el8stream

tinez avatar Apr 22 '24 13:04 tinez

el8stream OST is not working, el9stream should pass

michalskrivanek avatar Apr 22 '24 15:04 michalskrivanek

/ost basic-suite-master el9stream

michalskrivanek avatar Apr 22 '24 15:04 michalskrivanek

/ost

aesteve-rh avatar Apr 23 '24 10:04 aesteve-rh

@nirs do you have any comment on this PR? Otherwise, I will integrate.

aesteve-rh avatar Apr 23 '24 12:04 aesteve-rh

Hi Nir, PR is updated, my doubt in on the test coverage, but it should work in both cases, where backup server are defined and not.

josgutie avatar Apr 24 '24 16:04 josgutie

/ost

aesteve-rh avatar Apr 25 '24 08:04 aesteve-rh

/ost

aesteve-rh avatar Apr 25 '24 08:04 aesteve-rh

/ost

aesteve-rh avatar Apr 25 '24 11:04 aesteve-rh