reproman icon indicating copy to clipboard operation
reproman copied to clipboard

Permission denied on TACC when copying data over

Open jdkent opened this issue 4 years ago • 3 comments

reproman is looking awesome, it's so cool to be able to submit a job from my local machine to a HPC.

I ran into a couple snags when running the following code (using the reproman master branch) on TACC (lonestar5):

import os
from textwrap import dedent

import datalad.api as dapi
import reproman.api  as rapi


##########################
# Create lonestar resource
##########################
USERNAME = 'jdk3232'
KEY_FILENAME = os.path.join(os.path.expanduser("~"), ".ssh", "id_rsa")
ls_params = {
    'user': USERNAME,
    'key_filename': KEY_FILENAME,
    'host': 'ls5.tacc.utexas.edu'
}

if not any(['lonestar5' in resource[0] for resource in rapi.ls().values()]):
    rapi.create(name="lonestar5", resource_type="ssh", backend_parameters=ls_params)

################
# Create dataset
################
if os.path.isdir('./example'):
    dataset = dapi.Dataset('./example')
else:
    dataset = dapi.create("./example")
    sub_dataset = dapi.create("./output", dataset=dataset)
    dataset.add_readme()
    # create script
    script = "mkdir -p output && pwd > output/pwd.txt"
    with open("./example/script.sh", "w+") as sc:
        sc.write(script)
    os.chmod("./example/script.sh", 0o777)
    dataset.save()
    
##############
# Run reproman
##############

jps = {
    "num_nodes": 1,
    "launcher": 'true',
    "queue": "normal",
    "num_processes": 1,
    "walltime": 1,

}

os.chdir('./example')
rapi.run(
    command=['./script.sh'],
    resref="lonestar5",
    submitter="slurm",
    orchestrator="datalad-local-run",
    job_parameters=jps,
    inputs=["script.sh"],
    outputs=["output/pwd.txt"],
    follow=True,
    )

# remove example directory
# datalad remove -d example --nocheck -r ./example

Snags

  • [ ] I cannot run this code twice because I get a permissions error when copying the data over.
    2021-04-02 09:01:20,069 [INFO   ] No root directory supplied for lonestar5; using '/home1/07723/jdk3232/.reproman/run-root' 
    Traceback (most recent call last):
      File "example.py", line 52, in <module>
        rapi.run(
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/reproman/interface/run.py", line 423, in __call__
        orc.prepare_remote()
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/reproman/support/jobs/orchestrators.py", line 608, in prepare_remote
        session.put(i, op.join(self.working_directory,
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/reproman/resource/ssh.py", line 224, in put
        self.transfer_recursive(
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/reproman/resource/session.py", line 532, in transfer_recursive
        cp_fct(src_path, dest_path)
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/fabric/connection.py", line 788, in put
        return Transfer(self).put(*args, **kwargs)
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/fabric/transfer.py", line 311, in put
        self.sftp.put(localpath=local, remotepath=remote)
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/paramiko/sftp_client.py", line 759, in put
        return self.putfo(fl, remotepath, file_size, callback, confirm)
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/paramiko/sftp_client.py", line 714, in putfo
        with self.file(remotepath, "wb") as fr:
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/paramiko/sftp_client.py", line 372, in open
        t, msg = self._request(CMD_OPEN, filename, imode, attrblock)
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/paramiko/sftp_client.py", line 813, in _request
        return self._read_response(num)
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/paramiko/sftp_client.py", line 865, in _read_response
        self._convert_status(msg)
      File "/home/jdkent/projects/testRepromanTACC/.repo/lib/python3.8/site-packages/paramiko/sftp_client.py", line 896, in _convert_status
        raise IOError(errno.EACCES, text)
    PermissionError: [Errno 13] Permission denied
  • [ ] I need to make the output directory manually in my script since it is not copied over.
  • [ ] (small thing) the output suggests stderr and stdout should have the suffix of the job array (e.g., 0, 1, 2, 3), but I get another number instead (e.g., stderr.4294967294)

jdkent avatar Apr 02 '21 14:04 jdkent

Thanks for the feedback.

I cannot run this code twice because I get a permissions error when copying the data over.

Presumably you see the same error if you use fabric directly to copy into the run-root directory shown in the output above.

python -c 'from fabric import Connection; Connection("slurm").put("foo", "/path/to/run-root/")'

Do you see the same if you use sftp or scp to copy the file into the run root?

If it's a general permissions issue, I'm not sure there's much to do aside from tell reproman run to use a different location for root_directory.

I need to make the output directory manually in my script since it is not copied over.

Hmm, the current state of leaving it to scripts to ensure that output directories exist seems okay to me, though I think it'd probably be fine for the prepare_remote method of orchestrator classes to create them.

the output suggests stderr and stdout should have the suffix of the job array (e.g., 0, 1, 2, 3), but I get another number instead (e.g., stderr.4294967294)

Thanks for noticing that. That looks to be an interaction with the recently added launcher support. I don't know that we can get per-subjob output files in that case, but an accurate file name should at least be reported.

kyleam avatar Apr 05 '21 19:04 kyleam

The permission does indeed persist with fabric and scp. It does look possible to change the mode of the file when copying so it could be overwritable later.

This may be a large todo, but I'm curious if existing remote files and local files could be hashed, and only copied over if they changed (when using singularity containers, it would be nice to only have to copy them over once).

looks like it's possible to change permission on remote file: https://github.com/fabric/fabric/blob/35d7662ee020e8de236577a17571f1428c102479/fabric/transfer.py#L318

and hashing a file can be done in chunks as to not take too much memory: https://stackoverflow.com/questions/22058048/hashing-a-file-in-python or it looks like you could try to run a shell command on the remote machine like sha1sum and compare that with the local file. (and if the remote machine does not have that command, just assume files are different and copy them over).

smaller ask: chmod the remote file (if it exists) so it can be overwritten.

larger ask: hash local and remote file (if it exists) and overwrite if local is different.

jdkent avatar Apr 06 '21 18:04 jdkent

Both of your suggestions sound like good ideas to me.

smaller ask: chmod the remote file (if it exists) so it can be overwritten.

I think it'd be fine for the plain and datalad-local-run orchestrators to ensure that files have write permissions right after being copied, though I'd prefer not to touch files that are already on the remote. That'd solve the problem going forward, but existing locations would of course need to be adjusted manually.

larger ask: hash local and remote file (if it exists) and overwrite if local is different.

For the plain and datalad-local-run orchestrators, this sounds good too. And the local and remote sizes can be compared to avoid hashing in a subset of cases.

For the other orchestrators, the target location is a Git repository, and git-annex/DataLad handles these details. I'm guessing you're using datalad-local-run because you don't have git-annex available on the remote, but if that's not the case, I'd recommend you use datalad-pair-run or datalad-pair-run.

kyleam avatar Apr 06 '21 20:04 kyleam