aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

`CalcJob`: add support to `retrieve_list` to use globbing that maintain source file hierarchy

Open sphuber opened this issue 1 year ago • 4 comments

The current syntax for the retrieve_list supports globbing in the tuple variant, however, it forces the user to specify a depth, which indicates the levels of nesting to keep. Often though, the user simply wants to maintain the file hierarchy of the remote and not want to remove any level of nesting. This is not possible with the current syntax.

Imagine the following file hierarchy in the remote working directory:

├─ sub_a
│   ├─ vasprun.xml
│   └─ CHGCAR
├─ sub_b
│   ├─ some_file.xml
│   └─ CHGCAR
.
.

There should be a syntax that allows to retrieve all XML files while maintaining the folder hierarchy. So we want to end up with:

├─ sub_a
│   └─ vasprun.xml
├─ sub_b
│   └─ some_file.xml
.
.

I propose we allow the depth to be set to None which would accomplish the above use-case, i.e.:

retrieve_list = [('*/*.xml', '.', None)]

Unfortunately, we probably would have to keep the second and third element. At best we could reduce it to a two-element tuple

retrieve_list = [('*/*.xml', '.')]

and the None would be implied.

sphuber avatar Sep 21 '22 19:09 sphuber

@astamminger @ltalirz @giovannipizzi

sphuber avatar Sep 21 '22 19:09 sphuber

I am in favor.

Perhaps others can comment on whether None is self-explanatory here, or whether a more expressive label like MIRROR or FULL would make the list specification more intuitive to read.

ltalirz avatar Sep 21 '22 20:09 ltalirz

I personally interpret depth like a feature. Coming from that perspective, FULL complements this feature while None would mean hey, please disable this feature! So, basically both of them make sense to me.

However, if I would have to decide, I would opt for None because, you know, KISS :)

astamminger avatar Sep 21 '22 21:09 astamminger

Sounds good to me then

ltalirz avatar Sep 22 '22 18:09 ltalirz

Can you please comment on what would be the behaviour if one puts depth to zero instead? I admit it's not clear to me (I didn't check the code sorry), but anyway this is useful to clarify as we need to write in the docs

giovannipizzi avatar Oct 19 '22 09:10 giovannipizzi

If you have the hierarchy

├─ sub_a
│   └─ vasprun1.xml
│   └─ vasprun2.xml
│   └─ vasprun3.xml
.

and you specify

retrieve_list = [('sub_a/vasprun*.xml', '.', 0)]

it would retrieve all *xml files but without any leading folders:

├─ vasprun1.xml
├─ vasprun2.xml
└─ vasprun3.xml

So 0 will strip all leading nested folders.

sphuber avatar Oct 19 '22 09:10 sphuber

Ah ok, so if I understand it's the level "to be kept" so if I put 1, I get the same as None in the example above? (and the issue is that we don't need what the correct depth should be?).

Or equivalently, if I have sub_a/sub_b/file.xml I get only file.xml with 0, and sub_b/file.xml with 1?

If this is correct, then I see the need for None. I was thinking the depth was instead the number of folders to remove from the front, so 0 was equivalent to None, 1 to strip the parent folder, 2 to strip the two parent folders etc (but I think this is not what is happening?)

giovannipizzi avatar Oct 19 '22 21:10 giovannipizzi

Indeed, the depth specifies the number of levels to keep as you say, and so the problem is that currently you need to know the depth is to keep everything, whereas that is not always known a-priori

sphuber avatar Oct 20 '22 05:10 sphuber