ontology-development-kit
ontology-development-kit copied to clipboard
Facilitate or at least document how to share the OAK cache
The Ontology Access Kit (OAK, aka oaklib
from Python’s point of view, aka runoak
from the command line’s point of view) is one of the tools/libraries provided by the ODK.
In fact the ODK is supposedly one of the easiest way for “non-technical” users to get access to OAK, because installing Python programs is still too difficult for many people.
When OAK is used to access online resources (for example with -i sqlite:obo:uberon
, which accesses a pre-built SQLite version of Uberon), it attempts to cache a copy of those resources in the local filesystem, to avoid always re-downloading them upon each new call. The default location for the cache is ~/.data
, or the value of the PYSTOW_HOME
environment variable if such a variable is set.
(As an aside, defaulting to such a generic name under the user’s home directory is a terrible move, but that’s a deliberate decision that’s unlikely to ever change.)
Now when OAK is used from the ODK, the ~/.data
directory is within the Docker container. So any file that OAK is storing there will only exist for as long as the container itself exists. That means that when people are running several OAK commands like this:
sh run.sh runoak -i sqlite:obo:uberon command1 ...
sh run.sh runoak -i sqlite:obo:uberon command2 ...
sh run.sh runoak -i sqlite:obo:uberon command3 ...
none of these commands will benefit from the cache. They will all download a fresh copy of Uberon.
One workaround is of course to run a shell within a container, instead of running runoak
directly, and to then invoke runoak
from that shell:
sh run.sh bash
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command1 ...
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command2 ...
odkuser@abe8c94b5e84:/work/src/ontology$ runoak -i sqlite:obo:uberon command3 ...
But that is not really a satisfying solution as it will still lead to Uberon being re-downloaded every time the user has to start working with it, even if maybe they already downloaded it the day before.
It is possible to configure the ODK to make the local cache visible from the container by “binding” the ~/.data
directory from the local filesystem to the /home/odkuser/.data
directory within the container, by adding the following in the src/ontology/run.sh.conf
file:
ODK_BINDS=~/.data:/home/odkuser/.data
(This will only work once #1050 will have been fixed.)
Another solution would be set the PYSTOW_HOME
variable to a directory within the repository (most likely somewhere under src/ontology/tmp
), which is already bound to a mount point within the container. That would at least allow sharing the cache between ODK/OAK invocations that are run from within the same repository.
At the very least, the ODK should provide documentation on how to do that.
Should the ODK try to do that automatically? I am on the fence here. On one side, it’d be nice for users if the OAK cache could work “out of the box” without any extra configuration. On the other side, the ODK container is supposed to shield the local filesystem (except the actual repository) from any side-effects – everything that happens in the container stays in the container –, so it may not be a good idea to silently break a hole through the container’s wall: what if an interrupted download corrupts the cache? Users could expect that it would have no consequence, since the command was run inside a container – except that no, actually the cache is outside the container, so you’ve just corrupted your actual cache, oops!
Thoughts?
From experimenting with supporting this in the ODKRunner, I think we can do the following:
If the ODK_SHARE_OAK_CACHE
variable is set (in the environment or in the run.sh.conf
file), it is expected to point to the OAK cache directory. Then, we simply bind that directory to the directory /home/odkuser/.data/oaklib
within the container (or /root/.data/oaklib
, if we are running as root), so that any OAK process started from within the container can access the cache.
For a little bit more ease of use, we could also support two special values for ODK_SHARE_OAK_CACHE
:
(A) If ODK_SHARE_OAK_CACHE
is set to user
, then we automatically find the OAK cache directory, regardless of how Pystow is configured.
It wouldn’t be hard to do, but it would clutter the run.sh
script quite a bit, because we’d need to basically replicate Pistow’s logic to determine the location of the cache directory.
I am on the fence about whether this is really useful or not. The only way the OAK cache directory could be elsewhere than in ~/.data/oaklib
is if people have explicitly told Pystow to use another directory (by playing with the OAKLIB_HOME
, PYSTOW_HOME
, PYSTOW_NAME
, or PYSTOW_USE_APPDIRS
variables), and if they have done that then they know exactly where the cache directory is, and they can explicitly set ODK_SHARE_OAK_CACHE
to the correct path.
(B) If ODK_SHARE_OAK_CACHE
is set to repo
, then the cache directory is assumed to be in the src/ontology/tmp/oaklib
directory in the current repo (case of a user who would like to share the cache across multiple invocations of the ODK in the same repo, but not across all their repos).
This does not necessarily make things easier (it would be equivalent to ODK_SHARE_OAK_CACHE=$PWD/tmp/oaklib
, which is not much harder than ODK_SHARE_OAK_CACHE=repo
), but it would have the benefit of allowing for standardisation (the per-repo cache would always be located at the same place, instead of allowing people to use sometimes tmp/oaklib
and sometimes something else like tmp/oaklib-cache
, tmp/cache/oaklib
, etc).