sos icon indicating copy to clipboard operation
sos copied to clipboard

Revisit the path-translation mechanism

Open BoPeng opened this issue 6 years ago • 24 comments

We have a path conversion system that assumes that the task can be run on multiple remote systems if the paths are properly converted. The problem here is that it is not immediately clear that what paths will be converted and what will be used intact

So instead of

task: workdir='/local/path'

and let SoS to convert it to

task: workdir='/remote/path'

it would be much simpler and clearer to do

task: workdir='{home}/path`

where home can take different values on local and remote hosts.

Here home should be defined as one of the paths defined in host configuration file and it would be a natural limitation that the task cannot be executed on hosts without home definition.

Proposed interface

  1. A function path(name='home', default='whatever') is defined to retrieve paths defined in ~/.sos/hosts.yml for local and remote hosts. The idea is that this function will return different values when the task is executed locally or on a certain remote host.

  2. Nothing needs to be changed for local tasks.

  3. Cross-system tasks should be written as

task: 
run: workdir=path(name='scratch')
sh: expand=True
    rsync {path(name='data')}/mydata {path(name='scratch')}/mydata

to avoid hard-coded path names. When the task is executed on any host, the appropriate scratch, data values will be extracted from host definition. No implicit path conversion will be needed*.

  1. For tasks that are written for some remote host, Users can hard code paths without worry about configuration or implicit path conversion.
task: 
run: workdir=`/path/to/scratch'
sh: expand=True
    rsync /path/to/mydata /path/to/scratch/mydata

5 (TBD). _input and _output was synchronized by default, and we can keep doing that. The difference is that the local and remote versions of the path will be determined by its use of HOST. The same local and remote paths will be assumed if HOST is not used.

  1. (TBD). remote() will continue to be used to say the target is remote only so no synchronization is needed.

BoPeng avatar Mar 11 '19 00:03 BoPeng

So basically I am suggesting that

  1. If localhost is defined, the paths will be made available as runtime variables to be used in scripts.
  2. When the task is submitted to remote host, the paths will be replaced with values for the remote hosts.

The input, output synchronization etc will remain so the only change is that instead of asking users to write paths in local path name or remote path name, we ask them to use variables that will take different values on different hosts.

BoPeng avatar Mar 11 '19 02:03 BoPeng

I am wondering if we should make the paths more explicit. That is to say, instead of making

{'home': '/Users/user'}

part of the dictionary and use

input: f'{home}/whatever'

directly, perhaps it is clearer to do

input: f'{host.home}/whatever`

or

input: f'{HOST["home"]}/whatever`

An advantage of this approach is also that if some path is not available, we could assign a default path to it (e.g. $HOME).

BoPeng avatar Mar 11 '19 13:03 BoPeng

I think we already have a default variable "CONFIG", don't we? Then users can put in CONFIG['home'] for this purpose?

gaow avatar Mar 11 '19 14:03 gaow

With #1219, CONFIG is now pickled to remote host so that CONFIG is the same across local and remote hosts. The idea of this ticket is that we can make HOST a host-dependent variable so that tasks can adapt to different hosts in a more explicit way.

Using home directly sounds like a bad idea.

The problem with CONFIG is the configuration is

CONFIG['hosts'][ XXXX ]['paths']

where XXXX should be CONFIG['localhost'] for localhost, and host for some remote host specified by -q, so we need some host-dependent shortcut to`CONFIG['hosts'][XXXX]['paths']. Maybe we can post-generate

CONFIG['host']['paths']['home']

or simply

CONFIG['host']['home']

even

CONFIG['home']

since users do not need to access other host information (for now).

BoPeng avatar Mar 11 '19 14:03 BoPeng

So I seem to incline to

HOST['home']

since users can write

HOST.get('scratch', '/some/other/way')

if they are not sure if scratch is defined on certain remote host.

BoPeng avatar Mar 11 '19 14:03 BoPeng

Well, what else does the variable HOST do other than for this feature? We will now have to introduce another reserved keyword HOST. If it does not do much more than CONFIG['host']['home'] (or CONFIG['host'].get(...)) it perhaps cleaner to keep all configurations in CONFIG with extensions to host.

gaow avatar Mar 11 '19 15:03 gaow

The only problem is that

CONFIG['host']['home']

is 10 character longer than

HOST['home']

BoPeng avatar Mar 11 '19 16:03 BoPeng

Then will the same information be available in CONFIG['host'][xx]['home']? Possibly not because it is static? What happens if users do:

[global]
home = CONFIG['host']['home']

or

[global]
home = HOST['home']

and use home after -- will it be static or dynamic with respect to host environment? If it is the latter, then I'd rather take this approach if I need brevity, than to introducing a new variable only as a short cut to an existing variable.

gaow avatar Mar 11 '19 16:03 gaow

Good suggestion but will not work because things defined in global will be frozen (execute only once). Right now, I

  • load config files
  • execute global section
  • save the result and pass it around (including remote host).

The new feature will add

  • After the global section is loaded, which includes CONFIG, extract CONFIG['hosts'][where-am-i]['paths'] to some variable.

In another word, your home will always be configuration for localhost.

BoPeng avatar Mar 11 '19 16:03 BoPeng

but will not work because things defined in global will be frozen (execute only once).

Yes this is what I worry will happen that might confuse users. But from what you describe, if we modify CONFIG directly it should not happen?

>>> config = dict(a=dict(host='me'))
>>> host = config['a']
>>> host
{'host': 'me'}
>>> config['a']['host'] = 'them'
>>> host
{'host': 'them'}

Of course if users set:

host  = config['a']['host']

directly, then there is possibly little we can do ...

gaow avatar Mar 11 '19 16:03 gaow

We can change CONFIG but there is no guarantee that user derived variables will be changed accordingly. I mean, if we define CONFIG['host'] as a dynamic dictionary for local and remote host,

HOST = CONFIG['host']

will likely work because HOST is a reference to CONFIG['host'], but

home = CONFIG['host']['home']

will unlikely work because home can be a copy of the string, and even the first might not work because I do not know if pickle keeps the reference.

BoPeng avatar Mar 11 '19 16:03 BoPeng

So to avoid misuse of this variable, perhaps we can implement it as a function, something like

host_path('home')

and

host_path('scratch', 'default')

We can even reuse the path type and do something like

path(name='home', host='localhost')

BoPeng avatar Mar 11 '19 18:03 BoPeng

Yes I think this is the cleaner solution, if there is no easy way to bypass the problem we've discussed above.

gaow avatar Mar 11 '19 18:03 gaow

OK, choice of name then,

named_path('home')
host_path('home')
path_on_host('home')
path(name='home')

All of them will return the named path for the local host or the host on which the task is executed. The path interface has the advantage that the returned value will be a path.

BoPeng avatar Mar 11 '19 18:03 BoPeng

We can even reuse the path type and do something like

I see the benefit to eliminate new functions / targets. But the syntax looks a bit confusing.

Will syntax such as

CONFIG['host'].get(...)

help with the situation? Basically we will treat CONFIG['host'] a class with its own methods and conventions? A new function / target makes it cleaner, but it seems a very narrowly scoped function / target compared to the existing list of SoS function / targets.

gaow avatar Mar 11 '19 18:03 gaow

We tried to add some functions to CONFIG to freeze it (e.g. not modifiable or access items as attributes) but ended up using a plain dict. Part of the reason was that dict is much easier to pickle around.

Thinking more about it, path() returns a path, and we are simply adding the ability to get pre-defined system paths from their names, so it looks like a natural extension to me.

BoPeng avatar Mar 11 '19 19:03 BoPeng

Oh okay ... then I would think of reusing and extending path. It's just the syntax was not very clear to me -- are we setting or getting paths? Even path().get('home') feels a bit better in helping me thinking about what's going on.

gaow avatar Mar 11 '19 19:03 gaow

We are getting (not setting) CONFIG['hosts'][where-am-i]['paths'][name], which is readonly and is defined by users in ~/.sos/hosts.yml. The goal is to tell users how to write host-independent (mostly) script, which can be used beyond remote task but could also be abused. Something like

  1. users define hosts in ~/.sos/hosts.yml and define paths for different hosts.
  2. user can use path(name='scratch') to get these host-dependent paths and use in their scripts. E.g.
[1]
run: workdir=path(name='scratch', default='/tmp')
  1. Then the logics can be naturally extended to tasks
task: workdir=path(name='scratch')

when the task is submitted to remote host, on which scratch is different from local host.

BoPeng avatar Mar 11 '19 19:03 BoPeng

path().get('home') is bad because path() is a valid path. path.get('home') is decent because we are calling a static function of a class to return an instance of path, but static function is alien to novice Python users and path(name='home') can be easier to remember.

BoPeng avatar Mar 11 '19 19:03 BoPeng

I agree. At this point I cannot think of better conventions and I still find it reasonable to extend path. Hopefully implementing what you've proposed with clear documentation will serve the purpose!

gaow avatar Mar 11 '19 19:03 gaow

https://vatlab.github.io/sos-docs/doc/user_guide/remote_tasks.html documents the general ideas (not working yet). Basically, the goal is to make writing remote tasks easy.

  1. Tasks can use remote paths directly without worrying about translation. Users are expected to use queue= directly with task. This will be the main method recommended to (novice) users.

  2. Portable tasks using path(name) will be considered an advanced feature, in which case -q can be used.

BoPeng avatar Mar 13 '19 13:03 BoPeng

Things are getting complicated because _input etc are implicitly converted so we have

input:

_input in local format

task: host='remote'

translated _input

This change of context could be confusing...

BoPeng avatar Mar 13 '19 23:03 BoPeng

I have reverted d6409c152ab67750c0a09e1771958b88300084ba which was supposed to be in the branch.

Here is the number one case when I use remote task:

input: for_each=dict(ID=IDs)

info_of_sample = from_a_local_db(ID)

task:
sh: expand=True
process /remote/path/with/{ID} with {info_of_sample}

That is to say, I use local information and hard code remote path to create a script that will be executed remotely. _input is empty and there is no file synchronization.

BoPeng avatar Mar 14 '19 00:03 BoPeng

I have been thinking more on this and I admit that I do not have any better idea. I think,

  1. We should allow users to use remote paths directly for remote tasks. We used to ask users to use local path for workdir and this should be changed. I mean, instead of
task: workdir='/path/that/will/be/translated'

the new interpretation

task: workdir='/remote/path'

will work more naturally for most cases when there is no need for file synchronization.

  1. For remote tasks with group_by etc, it is easier to do
input: for_each =dict(ID=IDs)

task:
/path/to/{ID}

than

input: remote('/path/to/{ID}')
  1. If file synchronization is needed, I cannot find a simpler syntax than our existing automatic path-conversion mechanism (that converts _input and _output).

  2. For tasks to be "portable", our new path(name='home') can be used.

BoPeng avatar Mar 22 '19 19:03 BoPeng