aiida-core
aiida-core copied to clipboard
Engine: Fix bug introduced when refactoring `upload_calculation`
In 6898ff4d8c263cf08707c61411a005f6a7f731dd the implementation of the
processing of the local_copy_list
in the upload_calculation
method
was changed. Originally, the files specified by the local_copy_list
were first copied into the SandboxFolder
before copying its contents
to the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.
For example, if the sandbox contained the directory folder
and the
local_copy_list
contained the items (_, 'file', './folder/file')
this would work just fine in the original implementation as the file
would be written to the folder
on the remote folder. The current
implementation, however, would write the file content to folder/file
in a local temporary directory, and then iterate over the directories
and copy them over. Essentially it would be doing:
Transport.put('/tmpdir/folder', 'folder')
but since folder
would already exist on the remote working directory
the local folder would be nested and so the final path on the remote
would be /workingdir/folder/folder/file
.
The correct approach is to copy each item of the local_copy_list
from
the local temporary directory individually using the Transport.put
interface and not iterate over the top-level entries of the temporary
directory at the end.
Note that this is a critical bug that is currently on main
as it breaks for example things as simple as running a PwCalculation
from aiida-quantumespresso
so this has to be fixed before release.
Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo
folder would be copied inside an already existing one, causing QE to crash:
$ tree
.
├── CRASH
├── _aiidasubmit.sh
├── _scheduler-stderr.txt
├── _scheduler-stdout.txt
├── aiida.in
├── aiida.out
├── out
└── pseudo
└── pseudo
└── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
Can confirm that the changes in this PR fix that issue.
Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:
So he is running of the main
branch, correct? Because that bug should not be present in an actual release. Can we merge this PR then? Or do you have doubts about the changes?
So he is running of the main branch, correct?
Yes, it seems so, and reverting to v2.5.1 fixed the issue.
Can we merge this PR then? Or do you have doubts about the changes?
I still want to have a proper look at the code, so I can also make sure I understand the changes in https://github.com/aiidateam/aiida-core/commit/6898ff4d8c263cf08707c61411a005f6a7f731dd. Should have time for this tomorrow or Friday.
Looking into this some more, won't the following line cause similar woes?
https://github.com/sphuber/aiida-core/blob/20fc81065731a05eb5ee1da95976bda504d7f67d/src/aiida/engine/daemon/execmanager.py#L293
I was playing around with the following code:
import os
import pathlib
import shutil
from tempfile import TemporaryDirectory
from aiida import orm, load_profile
load_profile()
localhost = orm.load_computer('localhost')
remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'
folder_data = orm.FolderData(tree=pseudo_path)
shutil.rmtree(remote_workdir, ignore_errors=True)
def copy_local(transport, folder_data):
with TemporaryDirectory() as tmpdir:
dirpath = pathlib.Path(tmpdir)
data_node = folder_data
filepath_target = (dirpath / 'pseudo').resolve().absolute()
filepath_target.parent.mkdir(parents=True, exist_ok=True)
data_node.base.repository.copy_tree(filepath_target, 'pseudo')
transport.put(f'{dirpath}/*', transport.getcwd())
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
copy_local(transport, folder_data)
transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo')
The code above will give the following directory tree:
.
├── data
│ └── pseudo
│ ├── Ba.upf
│ └── Si.upf
├── sandbox.ipynb
└── workdir
└── pseudo
├── Ba.upf
├── Si.upf
└── pseudo
├── Ba.upf
└── Si.upf
But switching the order of the copy_local
and transport.copy
will not result in the nested psuedo
folder.
But switching the order of the copy_local and transport.copy will not result in the nested psuedo folder.
Sure, but that is because you are calling the following
transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo')
And that is saying copy the contents of the source pseudo
inside the pseudo
target directory. It is not saying update the contents of the target pseudo
dir with the contents of the source pseudo
. What you are trying to do is:
transport.copy(os.path.join(pseudo_path, 'pseudo/*'), 'pseudo')
So you are globbing the contents of pseudo
and copying that into the target pseudo
. If you change this line, then your script works as expected. This is also exactly the change I added in this PR for the local copy.
So I don't think there is a regression in the behavior of transport.copy
as it is used for the remote_copy_list
is there?
So I don't think there is a regression in the behavior of transport.copy as it is used for the remote_copy_list is there?
I don't think there is a regression, I was just wondering if we should make a similar change for remote_copy_list
, since now the copying behavior for the two is different when the folder is already present due to a previous copying operation.
I rewrote the example to rely on the functions in the execmanager
to see if I can break things (as I like to do):
from logging import LoggerAdapter
import shutil
from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.engine.daemon.execmanager import _copy_remote_files, _copy_local_files
load_profile()
random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))
localhost = orm.load_computer('localhost')
remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'
shutil.rmtree(remote_workdir, ignore_errors=True)
folder_data = orm.FolderData(tree=pseudo_path)
folder_data.store()
local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
_copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])
_copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())
Critical of course are the local_copy_list_item
and remote_copy_list_item
definitions:
local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')
Here, all is well, since I use the glob *
to copy the contents of /Users/mbercx/project/core/jupyter/data/pseudo
, as you noted. Removing the glob results in the nested pseudo
dirs, but perhaps this is expected.
If we remove the glob, and invert the _copy_local_files
and _copy_remote_files
operations (as we currently allow):
local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo', 'pseudo')
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
_copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())
_copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])
the behavior is different, i.e. there is no nested pseudo
folder. I guess now my question is if this difference in behavior is problematic.
Since _copy_remote_files
and _copy_local_files
are not methods for public use, maybe it not a issue that the behaviors are not parallel with each other?
Since
_copy_remote_files
and_copy_local_files
are not methods for public use, maybe it not a issue that the behaviors are not parallel with each other?
Sure, but they are actually used by the user, albeit it indirectly. The local_copy_list
and remote_copy_list
instructions are actually passed directly to these functions. If they now require different instructions to achieve the same behavior, I can see how that would be confusing and inconsistent. For example if in local_copy_list
you can/have to use relative_source/*
and for remote_copy_list
you don't use the glob. So I agree with @mbercx that we should look at it being consistent.
I need to try and find some time to try the example scripts on the latest release, to see what the original behavior was. I think the current changes on main
only change the behavior of local_copy_list
which is why I had to change it. But remote_copy_list
should not have changed, so I think the highlighted problem of nesting should be present in v2.5.1 as well.
TLDR: The solution of this PR is wrong. Not even so much for the discrepancy in behavior of local and remote copy lists, but really because the implementation of _copy_local_files
does not directly respect the copying instructions.
This is a tricky one. The first question is what the behavior of Transport.put
and Transport.copy
(intuitively) should be. If you copy dir_a
to dir_a
, what should happen if a) dir_a
does already exist in the target or b) dir_a
does not exist. If we follow the behavior of cp
, then there should be a difference in the two cases. If the target dir_a
already exists, the dir_a
source will be copied inside the existing directory, i.e., it will be nested. This is the current behavior of Transport.put
and Transport.copy
so it stands to be argued that that is the expected behavior.
The problem is really due to a detail of the original implementation of upload_calculation
. There are three sources of input files that are provided by the calcjob plugin that are copied to the remote working directory:
- The sandbox folder
- The
local_copy_list
- The
remote_copy_list
The implementation did not literally copy the contents of each sequentially to the working directory. Rather, it would copy the instructions of the local_copy_list
to the sandbox folder, and then it would copy the sandbox folder wholesale to the working dir. Critically, it did not use the Transport
interface for the first operation, but simply used normal Python file operations since the sandbox sits on the local file system. The remote_copy_list
was actually done through the transport interface directly to the working directory.
In the new implementation, this was changed, where the 3 copying steps are directly copied to the remote working dir, and the local_copy_list
contents are not merged into the sandbox folder first. But this now leads to problem if the sandbox folder creates a folder on the remote, that also exists in the local_copy_list
at which point the latter will now be nested. This happens for PwCalculation
for example, because the plugin manually creates the pseudo
folder in the sandbox, and then also copies the UpfData
files to ./pseudo/filename
through the local_copy_list
.
In principle, getting rid of the "hack" of merging local_copy_list
with the sandbox is a good thing in my opinion. However, the exact manner I did it in is causing the breaking of original functionality I believe. The _copy_local_files
has to copy the files from the instructions to a temporary file on the local disk because the Transport
interface does not allow file handles, and AiiDA's repository interface can only provide streaming handles. The problem is that I first copy all items of local_copy_list
to a temp dir on disk, and then copy all the contents to the remote working dir, however, this is not identical to copying the local_copy_list
instructions over one by one. So the solution, I think, is to copy each item of local_copy_list
as soon as it is written to temporary file locally, instead of copying them all locally and then copying the temp dir in one go.
~~Thanks for the critical review @mbercx . My original solution was horribly flawed and was trying to cover up a bad refactoring of the upload_calculation
in an earlier commit. I am now pretty confident that I have found the correct solution which doesn't introduce differences between local and remote copy operations and does not have to touch the behavior of the transports whatsoever. I have tested the change with aiida-quantumespresso
and it now works as expected.~~
Scratch that... it is still not quite that simple 😭
@mbercx could you give this another review please? The behavior of the remote_copy_list
and local_copy_list
in your example is now the same once again and I believe the original behavior of the latest release is also restored.
@sphuber just a note: I'm leaving on holiday tomorrow until the 20th, so will most likely not have time to review until after that... I agree the release should come out soon though. Maybe @giovannipizzi (due to his experience) or @khsrali (due to the fact that he's working on transports) can get involved in the review?
I think the discrepancy between local_copy_list
and remote_copy_list
in allowing the absence of parent folders is fine, although we could also allow this for remote_copy_list
as well by making the parent folders as we do for remote_copy_list
.
The fact that local_copy_list
copies the contents of a directory instead of the directory itself seems somewhat strange. I suppose this was necessary to maintain backwards compatibility with the previous approach of merging the sandbox folder and local_copy_list
?
The fact that local_copy_list copies the contents of a directory instead of the directory itself seems somewhat strange
If and only if the target directory already exists right? Otherwise it just copies as is.
The problem here is indeed really the fact that the old implementation did not go directly through the Transport
but had an intermediate merging step that used a different API and so we kind of now have to maintain that behavior in order to not break existing plugins, as demonstrated by PwCalculation
breaking.
If and only if the target directory already exists right? Otherwise it just copies as is.
Not sure if that's true, see my example near the end of https://github.com/aiidateam/aiida-core/pull/6348#discussion_r1594585968
Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the
pseudo
folder would be copied inside an already existing one, causing QE to crash:
Just a note that I also stumbled on this behavior now, and installing from the PR here fixes it. So I'll try to dedicate some time to reviewing the code and reading the discussion, maybe that could be helpful for merging.
I tested this PR for a while with aiida-quantumespresso, and so far I did not found any issues in the code behavior. I confirm that the main branch is completely broken. Can you add a test that covers this case before merging? Thank you
Can you add a test that covers this case before merging?
~~This PR includes a regression test already~~ Seems I have removed it in one of the iterations. I will see to adding it again.
@sphuber as discussed, I've written a bunch of tests to check the edge cases I've been studying, first based on the latest release tag (v2.5.1):
https://github.com/mbercx/aiida-core/blob/test/upload-calc/tests/engine/daemon/test_execmanager.py
These should all pass for the previous implementation, see
https://github.com/mbercx/aiida-core/actions/runs/9285274053/job/25549435202
I've then added the same tests as a commit on top of this PR:
https://github.com/mbercx/aiida-core/blob/test/upload-calc-order-fix/tests/engine/daemon/test_execmanager.py
Here, 3 tests fail:
https://github.com/mbercx/aiida-core/actions/runs/9285404390/job/25549825351
-
The sandbox folder is created with the
pseudo
folder and thelocal_copy_list
tries to copy the contents of aSingleFileData
to the target folder, see here. This would fail with aIsADirectoryError
in v2.5.1, now it creates an unexpected file hierarchy:> assert serialize_file_hierarchy(filepath_workdir) == expected_hierarchy E AssertionError: assert {'pseudo': {'... 'Ba pseudo'}} == {'pseudo': {'... 'Ba pseudo'}} E Differing items: E {'pseudo': {'pseudo': 'Ba pseudo'}} != {'pseudo': {'Ba.upf': 'Ba pseudo'}} E Use -v to get more diff
I.e. a nested
pseudo
folder. -
The sandbox folder is created with the
pseudo
folder and thelocal_copy_list
tries to copy the contents of thepseudo
folder in aFolderData
to the targetpseudo
folder, see here. Here also a nested hierarchy ({'pseudo': {'pseudo': {'Ba.upf': 'Ba pseudo'}}}
) is created instead of the previous one ({'pseudo': {'Ba.upf': 'Ba pseudo'}}
). -
Same as (2), but this time with a nested folder.
So the main point now is that the current copying behaviour of the PR does not preserve that of v2.5.1. I'll see if I can fix that first.
Ok, I've extended the number of tests, added tests for the SSH transport, and managed to get it down to one failure (for SSH):
https://github.com/mbercx/aiida-core/actions/runs/9345893736/job/25719640192
See all changes here: https://github.com/mbercx/aiida-core/commit/fa0e9c5fc175e8d7cff34da6705fb80f1673d6b8
It seems the issue is that the local and SSH transport return different errors when trying to remotely copy a file to a directory that doesn't yet exist. The local transport returns a FileNotFoundError
raised by the shutil.copy
command:
https://github.com/aiidateam/aiida-core/blob/acec0c190cbb45ba267c6eb8ee7ceba18cf3302b/src/aiida/transports/plugins/local.py#L565
The SSH transport returns an OSError instead:
https://github.com/aiidateam/aiida-core/blob/acec0c190cbb45ba267c6eb8ee7ceba18cf3302b/src/aiida/transports/plugins/ssh.py#L1191-L1195
For some reason, the engine deals with these two error differently:
https://github.com/aiidateam/aiida-core/blob/acec0c190cbb45ba267c6eb8ee7ceba18cf3302b/src/aiida/engine/daemon/execmanager.py#L292-L304
I'm now trying to figure out why these two are treated differently.
EDIT: the reason is explained here: https://github.com/aiidateam/aiida-core/commit/101a8d61ba1c9f50a0231cd249c5a7f7ff1d77a4
So similar to the source file not existing, if the target path is a file in a directory that doesn't exist, the failure will not be transient and hence we should have the SSH transport return a FileNotFoundError
to be consistent with the local transport.
I'm now trying to figure out why these two are treated differently.
The OSError
is a broad catch-all exception. The FileNoteFoundError
is a subclass of OSerror
and more specific. When copying remote files from remote_copy_list
we don't want to raise when the source does not exist, but just warns. If the copy fails due to another reason, the problem may be more severe, and so it makes sense to let the calculation error out.
What we should do is check for the SshTransport
what kind of OSError
it is throwing. You can figure this out with through the errno
attribute of the exception. You can use the constants in the errno
module to figure out the meaning. For example:
try:
os.remove(filename)
except OSError as e:
if e.errno == errno.ENOENT:
print('File does not exist')
See this PEP as well for details.
If the reason for the SshTransport is raising is the file is missing, we should have transport also raise the more specific FileNotFoundError
. And then the test will pass
Thanks @sphuber! Unfortunately, the .errno
attribute of the exception that is raised seems to be None
...
Instead I've added a check to see if the parent destination directory exists in the SshTransport.copy()
method, see:
https://github.com/mbercx/aiida-core/commit/011f5f7b4dc911528b974d0427d2d2fb076add0d#diff-1284fda2c2014c8c418475c99a82308d82ab96cb33e6c49d978c0031f3b1ef3d
It's more patchy for sure, but at least my tests now all pass locally. 😅 I'm praying they also pass on GitHub:
https://github.com/mbercx/aiida-core/actions/runs/9347879530/job/25725810490
Also double-checked my branch adding these tests on v2.5.1, and can confirm they all pass, save for the same error discussed above.
My suggestion for proceeding is to split up the PR in 3 commits:
-
SshTransport
: ReturnFileNotFoundError
if destination parent does not exist - Transports: overwrite existing folders if
overwrite == True
- Engine: Recover behavior for
upload_calculation
after refactor
Superseded by #6447