embrace pathlib fully in .core?
Currently I see 19 uses in code
$> git grep -e '\<op[j.]' -- datalad/core | grep -v test_ | nl
1 datalad/core/distributed/clone.py: config_content = op.read_file(cfg_path)
2 datalad/core/distributed/clone.py: config_content = op.read_file(cfg_path)
3 datalad/core/local/create.py: op.normpath(op.join(str(path), os.pardir)))
4 datalad/core/local/create.py: if op.isdir(tbds.path) and listdir(tbds.path) != [] and not force:
5 datalad/core/local/create.py: [op.join('.datalad', i[0]) for i in attrs_cfg])
6 datalad/core/local/create.py: op.join('.datalad', p), {}).get(k, None) == v:
7 datalad/core/local/create.py: attrfile=op.join('.datalad', '.gitattributes'))
8 datalad/core/local/diff.py: orig_path.endswith(op.sep) or resolved_path == ds.pathobj
9 datalad/core/local/run.py:from os.path import join as opj
10 datalad/core/local/run.py: rel_pwd = op.curdir
11 datalad/core/local/run.py: return [d for d in map(op.dirname, gpaths.expand(refresh=True))
12 datalad/core/local/run.py: if op.exists(path) or op.lexists(path):
13 datalad/core/local/run.py: pwd = normpath(opj(dataset.path, rel_pwd))
14 datalad/core/local/run.py: record_dir = ds.config.get('datalad.run.record-directory', default=op.join('.datalad', 'runinfo'))
15 datalad/core/local/run.py: record_path = op.join(ds_path, record_dir, record_id)
16 datalad/core/local/run.py: if not op.lexists(record_path):
17 datalad/core/local/run.py: msg_path = relpath(opj(str(repo.dot_git), "COMMIT_EDITMSG"))
18 datalad/core/local/status.py: not (orig_path.endswith(op.sep) or
19 datalad/core/local/status.py: super_root = get_dataset_root(op.dirname(root))
and about 159 in tests. I think it would be beneficial to get core fully pathlib driven.
No change re state within the past 18 months. Sigh.
18 left!
just a note for what kept causing issues for me (even bugs) when switching from os.path to pathlib -- absence of lexists analog in pathlib. So typically any lexists (45 hits ATM) needs to be replaced with smth like p.exists() or p.is_symlink(). IMHO quite suboptimal...
I am not sure what is suboptimal here. If you are hinting at making two calls rather than one, I'd say that the actual pathlib replacement for lexists() should use the same approach as os.path.lexists(), which would be
def pathlib_lexists(path):
try:
path.lstat()
except FileNotFoundError:
return False
return True
Related: https://github.com/datalad/datalad/issues/4056
Note that os.path.normpath has no clear equivalent in pathlib. Only function that resolves .. and . is Path.resolve, but it
- is context sensitive and outputs absolute paths
- also resolves symbolic links
We need to carefully consider how these differences should be handled.
The reason why normpath is not in pathlib is that it is not generally safe to use. Consider this:
Setup
❯ mkdir /tmp/norm
❯ ln -s /etc /tmp/norm/dummy
and now this:
>>> normpath('/tmp/norm/dummy/../usr')
'/tmp/norm/usr'
This path simply does not exist.
>>> Path('/tmp/norm/dummy/../usr').resolve()
PosixPath('/usr')
This is "correct".
The difference between normpath and resolve is massive. Not only in correctness, but also in performance. normpath is a purely lexical operation, resolve requires filesystem inspection to give the correct answer.
DataLad still uses normpath in some critical places. All associated functionality is subject to breakage, whenever such conditions are encountered.
FYI: Python 3.12 will bring a Pathlib alternative to os.walk
❱ git grep 'os.walk'
datalad/core/local/tests/test_save.py:887: for root, _, fnames in os.walk(op.join(path, save_path or >
datalad/local/check_dates.py:32: for root, dirs, _ in os.walk(path):
datalad/support/archive_utils_patool.py:180: for root, dirs, files in os.walk(dir_, followlinks=False):
datalad/support/archives.py:86: _, dirs, files = next(os.walk(dir_))
datalad/support/archives.py:91: subdir, subdirs_, files_ = next(os.walk(opj(dir_, dirs[0])))
datalad/support/archives.py:366: for root, dirs, files in os.walk(path): # TEMP
datalad/support/tests/test_gitrepo.py:740: for (dirpath, dirnames, filenames) in os.walk(path):
datalad/utils.py:426: for dirpath, dirnames, filenames in os.walk(topdir):
datalad/utils.py:494: for root, dirs, files in os.walk(path, followlinks=False):
@jwodder , as you have done quite a bit of timing on os.walk and its multithreading -- are you aware of any performance for that Pathlib alternative to os.walk?
@yarikoptic
- If you're referring to https://github.com/con/fscacher/pull/67 and related work, that made no use of
os.walk(). - I know nothing about the performance aspects of
Path.walk().