nebari icon indicating copy to clipboard operation
nebari copied to clipboard

[BUG] - conda environments fail to build

Open iameskild opened this issue 2 years ago • 4 comments

OS system and architecture in which you are running QHub

Ubuntu on GCP

Expected behavior

Creating a conda environment in the filesystem namespace (from the qhub-config.yaml) or my personal namespace should build my environment (provided that it is a valid env).

Actual behavior

When a submitting a conda environment (in the filesystem namespace or in my personal namespace), it will fail to build with the following error message:

(example of build failing in the filesystem ns)

Looking for: ['python==3.9.13', 'ipykernel==6.15.1', 'ipywidgets==7.7.1', 'qhub-dask==0.4.3', 'param==1.12.2', 'python-graphviz==0.20.1', 'matplotlib==3.3.2', 'panel==0.13.1', 'voila==0.3.6', 'streamlit==1.10.0', 'dash==2.6.1', 'cdsdashboards-singleuser==0.6.2']


Preparing transaction: ...working... failed

CondaError: Unable to create prefix directory '/home/conda/filesystem/7f7f767440c1987bc8eeacb1741b638c71c44f30ffb25d9e0503b6f2f4d9fe11-20220819-012441-874213-109-cds'.
Check that you have sufficient permissions.

How to Reproduce the problem?

Build any valid conda env from the conda-store endpoint or by adding it to the qhub-config.yaml, and it will fail to build.

Command output

No response

Versions and dependencies used.

qhub version: v0.4.4rc3 conda-store version: v0.4.9 or v0.4.11

Compute environment

No response

Integrations

No response

Anything else?

No response

iameskild avatar Aug 19 '22 01:08 iameskild

@iameskild this has to do with a change that I made in the container default uid/gid. I'll provide a fix tomorrow morning

costrouc avatar Aug 19 '22 03:08 costrouc

@costrouc @viniciusdc moving our slack conversation here for posterity.

CO: Issue is that conda-store in roughly 0.4.5+ now runs as user 1000 and not 0. So it no longer has 
permissions in that  folder. Not sure what the best route is. conda-store long term should not be running 
as root. I might chmod + chown that directory for conda-store

VC: I would say that long term each namespace/environemt should use a permission uuid based on 
keycloak permission system (though that might be a lot harder). For now, some kind of auto migration 
system from conda-store itself to move any environments and update its permission would work right?

VC:  > chmod + chown that directory for conda-store
Could we have a conda-store group, is that feasible? then we don't need to worry about user permissions

I think it makes sense to restrict the conda-store's permissions.

As for how to go about ensuring we this isn't a breaking change, could we add an initContainer as follows to the conda-store worker deployment:

      initContainers:
      - command:
        - /bin/chown
        - -R
        - "1000:1000"
        - /home/conda
        image: busybox:latest
        name: chmod-er
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /home/conda
          mountPropagation: None
          name: storage

I've tested this today on quansight-beta.qhub.dev and it does appear to correctly change the permissions for the existing files/folders under /home/conda:

drwxr-xr-x 13 1000 1000  4096 Aug 22 23:03 [email protected]

However I run into another permissions issue whenever I try to create a new env. The "default" gid still appears to be root:

drwxrwxr-x 13 1000 root 4096 Aug 22 23:04 6199e7747550f21efc268c887c71da3fc46117fe8f3b82876b2cfdfb14db7020-20220822-230259-522581-122-eae_test_5

Then when conda-store tries to change ownership, the following issue arises:

Logs from the conda-store-worker:

chown: changing ownership of '/home/conda/[email protected]/6199e7747550f21efc268c887c71da3fc46117fe8f3b82876b2cfdfb14db7020-20220822-230259-522581-122-eae_test_5': Operation not permitted
2022-08-22 23:04:15,296: WARNING/ForkPoolWorker-2] [CondaStoreWorker] ERROR | Command '['chown', '-R', '1000:1000', '/home/conda/[email protected]/6199e7747550f21efc268c887c71da3fc46117fe8f3b82876b2cfdfb14db7020-20220822-230259-522581-122-eae_test_5']' returned non-zero exit status 1.

I was able to get around this by adding fsGroup: 1000 to the pod's securityContext:

securityContext:
  fsGroup: 1000

iameskild avatar Aug 23 '22 04:08 iameskild

The above solution works when updating existing deployments but fails when new users sign in and for fresh deployments. Although the deployment scripts complete successfully, the trouble is that new conda envs can't be created due to permissions issues. This is due to how the initContainers (added by the KubeSpawner) set the permissions for the mounted volumes (specifically the conda-store-mount), see here.

Changing this permission to anything other than root will then break existing deployments. A solution might be to add another initContainer which correctly sets the permissions for all files/folders in the /home/conda before the others are called initContainers are run. The last hurdle for this solution is making sure that this new initContainer is the first one that is executed.

iameskild avatar Aug 25 '22 04:08 iameskild