dpdispatcher
dpdispatcher copied to clipboard
[BUG] rsync receive data from remote platform failed
Bug summary
I use dpgen to submit a dpgen job to run the fp on SUGON platform, the fp is like:
"fp": [
{
"command": "OMP_NUM_THREADS=1 mpirun -np 4 $abacus | tee out.log",
"machine": {
"batch_type": "Slurm",
"context_type": "SSHContext",
"local_root": "./",
"remote_root": "/public/home/abacus/tmp",
"remote_profile": {
"key_filename": "sugon",
"hostname": "cancon.hpccube.com",
"username": "abacus",
"port": 65023
}
},
"resources": {
"batch_type": "Slurm",
"number_node": 1,
"cpu_per_node": 32,
"group_size": 1,
"queue_name": "kshdnormal",
"custom_flags": [
"#SBATCH --gres=dcu:4"
],
"source_list": [
"/public/home/abacus/run_dcu.sh"
]
}
}
]
The fp job can be submitted to sugon and run abacus successfully, but it throw the below warning when dpgen get the returned results:
2024-01-23 13:53:23,653 - ERROR : Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 273, in try_download_result
self.download_jobs()
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 501, in download_jobs
self.machine.context.download(self)
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 675, in download
self._get_files(
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 905, in _get_files
self.ssh_session.get(from_f, to_f)
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 376, in get
return rsync(
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/utils.py", line 136, in rsync
raise RuntimeError(f"Failed to run {cmd}: {err}")
RuntimeError: Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
2024-01-23 13:53:23,655 - INFO : Retrying in 1 minute...
It seems that rsync try to do chown
action, but it is failed.
DP-GEN Version
0.11.1.dev51+gbea559b
Platform, Python Version, Remote Platform, etc
Platform: bohrium
Python: 3.8.8
Remote Platform: Sugon
Input Files, Running Commands, Error Log, etc.
dpgen.zip
Need an extra Sugon secret file named as "sugon".
command: dpgen init_bulk init.json machine.json
Steps to Reproduce
- download the secret file of sugon, and name as "sugon"
- modify the fp in machine.json
- submit the job: dpgen init_bulk init.json machine.json
Further Information, Files, and Links
No response
It's not related to the remote machine, but it seems you didn't have the access to chown on the local machine.
Could you try to add --no-perms
flag to rsync
?
Could you try to add
--no-perms
flag torsync
?
I have try to add this flag, but it did not work:
^CTraceback (most recent call last):
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 273, in try_download_result
self.download_jobs()
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 501, in download_jobs
self.machine.context.download(self)
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 675, in download
self._get_files(
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 905, in _get_files
self.ssh_session.get(from_f, to_f)
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 376, in get
return rsync(
File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/utils.py", line 137, in rsync
raise RuntimeError(f"Failed to run {cmd}: {err}")
RuntimeError: Failed to run ['rsync', '-az', '--no-perms', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/013b6a211b33560666b55f011a60f9771da63b60/013b6a211b33560666b55f011a60f9771da63b60.tar.gz', '/personal/test/init_and_run2/Al.STRU.02x01x01/00.place_ele/013b6a211b33560666b55f011a60f9771da63b60.tar.gz']: b'rsync: chown "/personal/test/init_and_run2/Al.STRU.02x01x01/00.place_ele/.013b6a211b33560666b55f011a60f9771da63b60.tar.gz.JIoelN" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
This issue may relate to directory right of Bohrium "/personal". When I run this test on others path, it will work.
Try no-o
. I guess no-g
may also be required. Below is the explanation.
-r, --recursive recurse into directories
-l, --links copy symlinks as symlinks
-p, --perms preserve permissions
-t, --times preserve modification times
-o, --owner preserve owner (super-user only)
-g, --group preserve group
-D same as --devices --specials
--devices preserve device files (super-user only)
--specials preserve special files
-a
is equivalent to -rltpgoD
I transfer the issue to dpdispatcher as it's more related.