data.world-py
data.world-py copied to clipboard
Lazy filesystem operations at FileConfig
Problem. The call dw.load_dataset('...', auto_update=True)
won't run on AWS Lambda with the error:
[Errno 30] Read-only file system: '/home/sbx_user1051'
The crash happens at datadotworld.config:FileConfig.__init__
, at the line
os.makedirs(path.dirname(self._config_file_path))
The reason of this is that the single directory you can write at in AWS Lambda is /tmp
. Note the environment variables (DW_AUTH_TOKEN
, DW_CACHE_DIR
, DW_TMP_DIR
) were correctly (as can be seen from the following) configured in Lambda panel, and the two latter ones were pointed to /tmp/dw_cache/
and /tmp/dw_tmp/
directories, respectively.
Thus, we expect DW to look into environment variables but it still reaches out to the readonly filesystem. The mechanism of this is located at datadotworld._get_instance()
function.
Manual solution is straightforward: we can use dw.DataDotWorld(config=dw.EnvConfig()).load_dataset(...)
- that works perfectly.
However, ChainConfig
seems to be built specifically for use cases like this, with the intention of automatically guessing what config to use without unwanted side effects from other configs.
Automatic Solution, step 1: FileConfig operations. This pull request moves over the FS-dependent code into a function of FileConfig
and wraps self._config_parser
into a lazy property. A test (however clumsy) is written. Proves to work with AWS Lambda, but not to the full extent. It crashes:
...
File "/var/task/datadotworld/__init__.py", line 101, in load_dataset
auto_update=auto_update)
File "/var/task/datadotworld/datadotworld.py", line 152, in load_dataset
dataset_key, cache_dir)
File "/var/task/datadotworld/client/api.py", line 494, in download_datapackage
shutil.move(unzipped_dir, dest_dir)
...
File "/var/lang/lib/python3.7/os.py", line 221, in makedirs
mkdir(name, mode)
Why?
Automatic Solution, step 2: Cache dir. Research shows that ChainConfig
is looping through all configs in its chain and fetching the first one which returns a non-None
value for the field it is in need right now. That's what happens for cache_dir
, which is hardcoded in DefaultConfig
to be path.expanduser('~/.dw/cache')
. Since InlineConfig
is the first element in the chain and directly derives this field from DefaultConfig
, - ChainConfig
never even asks EnvConfig
for this field. Using the default config chain mechanism, we can never override cache_dir
.
I am removing this default declaration from DefaultConfig
and moving it over to FileConfig
where it does not bother me, but I realize a) this might be a backwards incompatible change and b) there is no actual reason to put the default value specifically into FileConfig
. The more correct mechanism would be probably to poll all configs in the chain and, only if none of them returned a valid value, get a default value out of the sleeve.
Would be happy to hear feedback and thoughts.
@anatoly-scherbakov thanks for the pull request. I'm also surprised that setting DW_CACHE_DIR, DW_TMP_DIR
didn't work by default, so thanks for digging in further and figuring out why they don't work. Getting this library to work seamlessly on a lambda and respect the ENV based configs is definitely a good move forward.
I'll circle back with members of the team to determine the best course of action.
Also, sorry about the issues with the unit tests. I've put a fix in and if you merge master, those should be good now.