sos
sos copied to clipboard
Revisit the path-translation mechanism
We have a path conversion system that assumes that the task can be run on multiple remote systems if the paths are properly converted. The problem here is that it is not immediately clear that what paths will be converted and what will be used intact
So instead of
task: workdir='/local/path'
and let SoS to convert it to
task: workdir='/remote/path'
it would be much simpler and clearer to do
task: workdir='{home}/path`
where home can take different values on local and remote hosts.
Here home should be defined as one of the paths defined in host configuration file and it would be a natural limitation that the task cannot be executed on hosts without home definition.
Proposed interface
-
A function
path(name='home', default='whatever')is defined to retrieve paths defined in~/.sos/hosts.ymlfor local and remote hosts. The idea is that this function will return different values when the task is executed locally or on a certain remote host. -
Nothing needs to be changed for local tasks.
-
Cross-system tasks should be written as
task:
run: workdir=path(name='scratch')
sh: expand=True
rsync {path(name='data')}/mydata {path(name='scratch')}/mydata
to avoid hard-coded path names. When the task is executed on any host, the appropriate scratch, data values will be extracted from host definition. No implicit path conversion will be needed*.
- For tasks that are written for some remote host, Users can hard code paths without worry about configuration or implicit path conversion.
task:
run: workdir=`/path/to/scratch'
sh: expand=True
rsync /path/to/mydata /path/to/scratch/mydata
5 (TBD). _input and _output was synchronized by default, and we can keep doing that. The difference is that the local and remote versions of the path will be determined by its use of HOST. The same local and remote paths will be assumed if HOST is not used.
- (TBD).
remote()will continue to be used to say the target is remote only so no synchronization is needed.
So basically I am suggesting that
- If
localhostis defined, thepathswill be made available as runtime variables to be used in scripts. - When the task is submitted to remote host, the
pathswill be replaced with values for the remote hosts.
The input, output synchronization etc will remain so the only change is that instead of asking users to write paths in local path name or remote path name, we ask them to use variables that will take different values on different hosts.
I am wondering if we should make the paths more explicit. That is to say, instead of making
{'home': '/Users/user'}
part of the dictionary and use
input: f'{home}/whatever'
directly, perhaps it is clearer to do
input: f'{host.home}/whatever`
or
input: f'{HOST["home"]}/whatever`
An advantage of this approach is also that if some path is not available, we could assign a default path to it (e.g. $HOME).
I think we already have a default variable "CONFIG", don't we? Then users can put in CONFIG['home'] for this purpose?
With #1219, CONFIG is now pickled to remote host so that CONFIG is the same across local and remote hosts. The idea of this ticket is that we can make HOST a host-dependent variable so that tasks can adapt to different hosts in a more explicit way.
Using home directly sounds like a bad idea.
The problem with CONFIG is the configuration is
CONFIG['hosts'][ XXXX ]['paths']
where XXXX should be CONFIG['localhost'] for localhost, and host for some remote host specified by -q, so we need some host-dependent shortcut to`CONFIG['hosts'][XXXX]['paths']. Maybe we can post-generate
CONFIG['host']['paths']['home']
or simply
CONFIG['host']['home']
even
CONFIG['home']
since users do not need to access other host information (for now).
So I seem to incline to
HOST['home']
since users can write
HOST.get('scratch', '/some/other/way')
if they are not sure if scratch is defined on certain remote host.
Well, what else does the variable HOST do other than for this feature? We will now have to introduce another reserved keyword HOST. If it does not do much more than CONFIG['host']['home'] (or CONFIG['host'].get(...)) it perhaps cleaner to keep all configurations in CONFIG with extensions to host.
The only problem is that
CONFIG['host']['home']
is 10 character longer than
HOST['home']
Then will the same information be available in CONFIG['host'][xx]['home']? Possibly not because it is static? What happens if users do:
[global]
home = CONFIG['host']['home']
or
[global]
home = HOST['home']
and use home after -- will it be static or dynamic with respect to host environment? If it is the latter, then I'd rather take this approach if I need brevity, than to introducing a new variable only as a short cut to an existing variable.
Good suggestion but will not work because things defined in global will be frozen (execute only once). Right now, I
- load config files
- execute global section
- save the result and pass it around (including remote host).
The new feature will add
- After the global section is loaded, which includes
CONFIG, extractCONFIG['hosts'][where-am-i]['paths']to some variable.
In another word, your home will always be configuration for localhost.
but will not work because things defined in global will be frozen (execute only once).
Yes this is what I worry will happen that might confuse users. But from what you describe, if we modify CONFIG directly it should not happen?
>>> config = dict(a=dict(host='me'))
>>> host = config['a']
>>> host
{'host': 'me'}
>>> config['a']['host'] = 'them'
>>> host
{'host': 'them'}
Of course if users set:
host = config['a']['host']
directly, then there is possibly little we can do ...
We can change CONFIG but there is no guarantee that user derived variables will be changed accordingly. I mean, if we define CONFIG['host'] as a dynamic dictionary for local and remote host,
HOST = CONFIG['host']
will likely work because HOST is a reference to CONFIG['host'], but
home = CONFIG['host']['home']
will unlikely work because home can be a copy of the string, and even the first might not work because I do not know if pickle keeps the reference.
So to avoid misuse of this variable, perhaps we can implement it as a function, something like
host_path('home')
and
host_path('scratch', 'default')
We can even reuse the path type and do something like
path(name='home', host='localhost')
Yes I think this is the cleaner solution, if there is no easy way to bypass the problem we've discussed above.
OK, choice of name then,
named_path('home')
host_path('home')
path_on_host('home')
path(name='home')
All of them will return the named path for the local host or the host on which the task is executed. The path interface has the advantage that the returned value will be a path.
We can even reuse the path type and do something like
I see the benefit to eliminate new functions / targets. But the syntax looks a bit confusing.
Will syntax such as
CONFIG['host'].get(...)
help with the situation? Basically we will treat CONFIG['host'] a class with its own methods and conventions? A new function / target makes it cleaner, but it seems a very narrowly scoped function / target compared to the existing list of SoS function / targets.
We tried to add some functions to CONFIG to freeze it (e.g. not modifiable or access items as attributes) but ended up using a plain dict. Part of the reason was that dict is much easier to pickle around.
Thinking more about it, path() returns a path, and we are simply adding the ability to get pre-defined system paths from their names, so it looks like a natural extension to me.
Oh okay ... then I would think of reusing and extending path. It's just the syntax was not very clear to me -- are we setting or getting paths? Even path().get('home') feels a bit better in helping me thinking about what's going on.
We are getting (not setting) CONFIG['hosts'][where-am-i]['paths'][name], which is readonly and is defined by users in ~/.sos/hosts.yml. The goal is to tell users how to write host-independent (mostly) script, which can be used beyond remote task but could also be abused. Something like
- users define hosts in
~/.sos/hosts.ymland definepathsfor different hosts. - user can use
path(name='scratch')to get these host-dependent paths and use in their scripts. E.g.
[1]
run: workdir=path(name='scratch', default='/tmp')
- Then the logics can be naturally extended to
tasks
task: workdir=path(name='scratch')
when the task is submitted to remote host, on which scratch is different from local host.
path().get('home') is bad because path() is a valid path. path.get('home') is decent because we are calling a static function of a class to return an instance of path, but static function is alien to novice Python users and path(name='home') can be easier to remember.
I agree. At this point I cannot think of better conventions and I still find it reasonable to extend path. Hopefully implementing what you've proposed with clear documentation will serve the purpose!
https://vatlab.github.io/sos-docs/doc/user_guide/remote_tasks.html documents the general ideas (not working yet). Basically, the goal is to make writing remote tasks easy.
-
Tasks can use remote paths directly without worrying about translation. Users are expected to use
queue=directly withtask. This will be the main method recommended to (novice) users. -
Portable tasks using
path(name)will be considered an advanced feature, in which case-qcan be used.
Things are getting complicated because _input etc are implicitly converted so we have
input:
_input in local format
task: host='remote'
translated _input
This change of context could be confusing...
I have reverted d6409c152ab67750c0a09e1771958b88300084ba which was supposed to be in the branch.
Here is the number one case when I use remote task:
input: for_each=dict(ID=IDs)
info_of_sample = from_a_local_db(ID)
task:
sh: expand=True
process /remote/path/with/{ID} with {info_of_sample}
That is to say, I use local information and hard code remote path to create a script that will be executed remotely. _input is empty and there is no file synchronization.
I have been thinking more on this and I admit that I do not have any better idea. I think,
- We should allow users to use remote paths directly for remote tasks. We used to ask users to use local path for
workdirand this should be changed. I mean, instead of
task: workdir='/path/that/will/be/translated'
the new interpretation
task: workdir='/remote/path'
will work more naturally for most cases when there is no need for file synchronization.
- For remote tasks with
group_byetc, it is easier to do
input: for_each =dict(ID=IDs)
task:
/path/to/{ID}
than
input: remote('/path/to/{ID}')
-
If file synchronization is needed, I cannot find a simpler syntax than our existing automatic path-conversion mechanism (that converts
_inputand_output). -
For tasks to be "portable", our new
path(name='home')can be used.