1click-hpc icon indicating copy to clipboard operation
1click-hpc copied to clipboard

DCV authentication issue with AD users at the creation of Linux DCV sessions in a 1Click-HPC cluster

Open vbosquier opened this issue 3 years ago • 0 comments

Ciao Nicola!

As you know in the context of a HPC POC in AWS for a french company, UCit (mainly myself) have sligtly modified and used 1Click-HPC to run the POC's HPC environment. I have faced a strange authentication issue while trying to start a DCV session on a g4dn instance with Centos 7.9.2009 + DCV 2022.0 r12760 + EnginFrame (EF) 2021.0-r1592 + Slurm 21.08.8-2 on AWS. The issue is specific to the users stored in the AD attached to the cluster.

I have finally found a fix for the issue but I think it's important to discuss it with you to understand what's the underlying behaviour here...

The symptom is the following;

  • launching a DCV Session as system user "centos" using a standard Linux Desktop Service in EF works fine
  • launching a DCV session as a user created in the AD using the exact same standard Linux Desktop Service in EF fails because of an autentication issue.

The error message got in slurm-$JobID.out is the following:

[2022/06/09 14:40:15]  INFO  Starting DCV session...
[2022/06/09 14:40:15]  INFO  DCV version supports --gl-display parameter
[2022/06/09 14:40:15]  INFO  Creating DCV session "dcv create-session --type=virtual tmp7339021573904918669 "
Could not create session. Could not get the system bus: Exhausted all available authentication mechanisms (tried: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS) (available: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS)
[2022/06/09 14:40:15] ERROR  Failed to launch DCV session (exit code: 1)
[2022/06/09 14:40:15] FATAL  Error: DCV failed to create session
[2022/06/09 14:40:15] FATAL  Exiting with code 1

After a lot of tests, described below, I have found a solution which consists in adding at the very beginning of the file $EF_ROOT/plugins/interactive/lib/remote/linux.jobscript.functions the following line:

id "${USER}"

Indeed, this initializes kind of a "first connection" of the user trying to start a session on the targeted system, so that the User is known at system level...

With the help of Benjamin Depardon, I have tested the issue by issueing on the Head Node of the cluster in the command line the following command:

srun -N 1 -p dcv-gpu --exclusive -C "[g4dn.xlarge*1]" dcv create-session my_session

And then, we have tried all the following options:

  • restarting gdm only after dcvserver was restarted at the end of the installation process => NOT working

  • restarting dbus + dbus-org.+ gdm after dcvserver was restarted at the end of the installation process => NOT working

  • changing /etc/pam.d/dcv with the following contents

#%PAM-1.0
# Default NICE DCV PAM configuration.
# This file is auto-generated, user changes will be destroyed at
# installation/update time.
# To make changes, create a file named dcv.custom in this
# directory and set the 'pam-service-name' parameter in the
# [security] section of dcv.conf to 'dcv.custom'
#auth    include password-auth
#account include password-auth
auth    include password-auth
account     required                                     pam_access.so
account     required                                     pam_unix.so
account     sufficient                                   pam_localuser.so
account     sufficient                                   pam_usertype.so issystem
account     [default=bad success=ok user_unknown=ignore] pam_sss.so
account     required                                     pam_permit.so

=> NOT working

  • running on the remote system the commands: $> getent passwd | grep username or $> getent passwd -s sss | grep username or $> sssctl cache-upgrade => NOT working

  • adding the following command at the very beginning of Slurm's prolog.sh script: $> id "${SLURM_JOB_USER}" -> NOT working

  • running the following command on the DCV node before the session was created: $> id username or $> sssctl user-checks username => SUCCESSFUL

  • connecting on the DCV node with SSH as the user username (or as any other user and the switching with the command: $> su - username) before the session was created

=> SUCCESSFUL

Our conclusion is that the user must be known by the system (and stored in any kind of cache) for the authentication process to allow the execution of the tasks required by the Slurm job.

Our questions are:

  • is it a know issue?
  • can you explain further how the internal authentication methods of DCV work and why in our case DCV has denied the authorization for the user in AD to crete a session?
  • is there a "better" way to solve it than to hack EF code the way we did to allow any user in AD to launch a DCV session?

Please don't hesitate to ask for any complementary information and to let us know what you think.

Best regards, Vincent.

vbosquier avatar Jun 09 '22 16:06 vbosquier