ceph-container icon indicating copy to clipboard operation
ceph-container copied to clipboard

ship nfs-ganesha V6 in main and squid

Open ktdreyer opened this issue 1 year ago • 16 comments

nfs-ganesha V6.0 had issues in the squid container, referenced at https://github.com/ceph/ceph-container/pull/2244

Would you please ship the latest V6 version that has the fixes we need?

ktdreyer avatar Oct 02 '24 14:10 ktdreyer

CC @kalebskeithley

ktdreyer avatar Oct 02 '24 14:10 ktdreyer

So what exactly is the failure? Log messages?

ffilz avatar Oct 02 '24 20:10 ffilz

I see the log message in https://github.com/rook/rook/issues/14771

dmick avatar Oct 02 '24 21:10 dmick

The suggested fix, as I understand it, was to add a configuration setting to ignore failure of the system call that could be used by users of nfs-ganesha. Again, as I understand it, this means adding something to the nfs-ganesha deploy for ceph.

I proposed another possible fix for the issue, which was to add parameters to the invocation of the nfs-ganesha container to allow the system call to penetrate the seccomp jail.

Both of these seem to require further changes to Ceph nfs-ganesha deployment before we can integrate v6.x. This leads me to further believe that there's been no testing of v6.x on containerized Ceph (by far the majority deployment strategy for Ceph upstream and the only one for downstream).

dmick avatar Oct 02 '24 21:10 dmick

It has been years since Ganesha has got meaningful testing from various adopters prior to a release tag, so it is not at all surprising that V6.0 has problems.

This issue has been identified. One fix went into downstream that was accepted for upstream, but before it was merged upstream, another adopter also hit the problem and proposed a different fix.

What has ultimately gone upstream into V6.1 is a fix agreed on by participants in a Ganesha community call last week. This fix does require a config change when operating in a non-privileged container. That config change allows Ganesha to try the prctl, and if it fails with EPERM, to issue a warning message but proceed to start.

The prctl call was added in Ganesha V6-dev.19.

The prctl call is necessary to prevent a kernel deadlock if there is a local NFS mount using Ganesha as the server. The deadlock can occur if pages need to be flushed via NFS to allow Ganesha to allocate pages to proceed. Since Ganesha is blocked waiting for free pages, it can not serve the request, via the NFS client, to flush the pages necessary to be able to allocate pages.

One thing that isn't clear is if Ganesha running in a container while some other container on the host has an NFS mount, or even a non-containerized NFS mount could also trigger the problem. That may well be the case.

If the prctl is required for containerized Ganesha, then either prctl needs to be allowed for that container, or maybe there's some way to do the prctl for the container as a whole.

In any case, setting the parameter to true will allow Ganesha to proceed without prctl success, and it will be in the same boat as it was pre-V6-dev.19.

ffilz avatar Oct 02 '24 22:10 ffilz

Yep. So the new version will require a change in ceph deployment code before it can even run, and it's untested-in-Ceph code beyond that deployment change. (To be clear I'm talking about at least cephadm/ceph orchestrator; I don't even know if we currently support deploying on non-cephadm deploys.) This PR cannot be accepted until those changes, and perhaps at least some kind of testing, are complete.

dmick avatar Oct 02 '24 22:10 dmick

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 18 '24 20:10 github-actions[bot]

Hi @guits

  • Would you please tell us if you've confirmed that nfs-ganesha V6.1 has the fixes we need in order to ship it in main and squid?
  • We have https://github.com/ceph/ceph/blob/squid/container/Containerfile now, so we'll need to make this change in that repo and this one. Is that right @dmick ?

ktdreyer avatar Oct 25 '24 13:10 ktdreyer

@ktdreyer As far as I know it's still the case that we have to change something to add allow_set_io_flusher_fail=true to the nfs-ganesha config /etc/ganesha/ganesha.conf. I'm not exactly sure what agent is going to have to change to set this, probably cephadm, which already knows about ganesha.conf, but without it, the service will fail to start in a container because it can't do a prctl(PR_SET_IO_FLUSHER).

dmick avatar Oct 25 '24 20:10 dmick

...and yes, of course, for CI builds we'll also need to change ceph.git's container/Containerfile

dmick avatar Oct 25 '24 20:10 dmick

Yes, nothing is changed there.

On Fri, Oct 25, 2024 at 1:15 PM Dan Mick @.***> wrote:

@ktdreyer https://github.com/ktdreyer As far as I know it's still the case that we have to change something to add allow_set_io_flusher_fail=true to the nfs-ganesha config /etc/ganesha/ganesha.conf. I'm not exactly sure what agent is going to have to change to set this, probably cephadm, which already knows about ganesha.conf, but without it, the service will fail to start in a container because it can't do a prctl(PR_SET_IO_FLUSHER).

— Reply to this email directly, view it on GitHub https://github.com/ceph/ceph-container/issues/2246#issuecomment-2438714901, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADBQU536TLUQXRSTOXITNDZ5KRGNAVCNFSM6AAAAABPHZE4SGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZYG4YTIOJQGE . You are receiving this because you commented.Message ID: @.***>

ffilz avatar Oct 26 '24 00:10 ffilz

@ffilz , would you please open a ticket at https://tracker.ceph.com/ so that the cephadm developers understand what changes to make in Ceph's installer?

(Here's an example ticket you can use for inspiration: https://tracker.ceph.com/issues/65144 , it was under "Project: Orchestrator")

ktdreyer avatar Oct 28 '24 13:10 ktdreyer

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 12 '24 20:11 github-actions[bot]

@kalebskeithley and @adk3798 met with me today about this. I learned that https://github.com/ceph/ceph/pull/60077 has been in main for a while. As a result, it should be safe to put nfs-ganesha V6.1 into the Containerfile for main.

For the squid container, when we merge https://github.com/ceph/ceph/pull/60360, it will be safe to ship nfs-ganesha V6.1 in the squid container also.

ktdreyer avatar Nov 19 '24 18:11 ktdreyer

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 04 '24 20:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions[bot] avatar Dec 11 '24 20:12 github-actions[bot]