data.world-py icon indicating copy to clipboard operation
data.world-py copied to clipboard

Lazy filesystem operations at FileConfig

Open anatoly-scherbakov opened this issue 5 years ago • 2 comments

Problem. The call dw.load_dataset('...', auto_update=True) won't run on AWS Lambda with the error:

[Errno 30] Read-only file system: '/home/sbx_user1051'

The crash happens at datadotworld.config:FileConfig.__init__, at the line

os.makedirs(path.dirname(self._config_file_path))

The reason of this is that the single directory you can write at in AWS Lambda is /tmp. Note the environment variables (DW_AUTH_TOKEN, DW_CACHE_DIR, DW_TMP_DIR) were correctly (as can be seen from the following) configured in Lambda panel, and the two latter ones were pointed to /tmp/dw_cache/ and /tmp/dw_tmp/ directories, respectively.

Thus, we expect DW to look into environment variables but it still reaches out to the readonly filesystem. The mechanism of this is located at datadotworld._get_instance() function.

Manual solution is straightforward: we can use dw.DataDotWorld(config=dw.EnvConfig()).load_dataset(...) - that works perfectly.

However, ChainConfig seems to be built specifically for use cases like this, with the intention of automatically guessing what config to use without unwanted side effects from other configs.

Automatic Solution, step 1: FileConfig operations. This pull request moves over the FS-dependent code into a function of FileConfig and wraps self._config_parser into a lazy property. A test (however clumsy) is written. Proves to work with AWS Lambda, but not to the full extent. It crashes:

...
  File "/var/task/datadotworld/__init__.py", line 101, in load_dataset
    auto_update=auto_update)
  File "/var/task/datadotworld/datadotworld.py", line 152, in load_dataset
    dataset_key, cache_dir)
  File "/var/task/datadotworld/client/api.py", line 494, in download_datapackage
    shutil.move(unzipped_dir, dest_dir)
   ...
  File "/var/lang/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)

Why?

Automatic Solution, step 2: Cache dir. Research shows that ChainConfig is looping through all configs in its chain and fetching the first one which returns a non-None value for the field it is in need right now. That's what happens for cache_dir, which is hardcoded in DefaultConfig to be path.expanduser('~/.dw/cache'). Since InlineConfig is the first element in the chain and directly derives this field from DefaultConfig, - ChainConfig never even asks EnvConfig for this field. Using the default config chain mechanism, we can never override cache_dir.

I am removing this default declaration from DefaultConfig and moving it over to FileConfig where it does not bother me, but I realize a) this might be a backwards incompatible change and b) there is no actual reason to put the default value specifically into FileConfig. The more correct mechanism would be probably to poll all configs in the chain and, only if none of them returned a valid value, get a default value out of the sleeve.

Would be happy to hear feedback and thoughts.

anatoly-scherbakov avatar Jun 21 '19 13:06 anatoly-scherbakov

@anatoly-scherbakov thanks for the pull request. I'm also surprised that setting DW_CACHE_DIR, DW_TMP_DIR didn't work by default, so thanks for digging in further and figuring out why they don't work. Getting this library to work seamlessly on a lambda and respect the ENV based configs is definitely a good move forward.

I'll circle back with members of the team to determine the best course of action.

markmarkoh avatar Jun 21 '19 14:06 markmarkoh

Also, sorry about the issues with the unit tests. I've put a fix in and if you merge master, those should be good now.

laconc avatar Jul 12 '19 16:07 laconc