datalad embrace pathlib fully in .core?

Currently I see 19 uses in code

$> git grep -e '\<op[j.]' -- datalad/core | grep -v test_ | nl
     1	datalad/core/distributed/clone.py:                config_content = op.read_file(cfg_path)
     2	datalad/core/distributed/clone.py:                config_content = op.read_file(cfg_path)
     3	datalad/core/local/create.py:            op.normpath(op.join(str(path), os.pardir)))
     4	datalad/core/local/create.py:        if op.isdir(tbds.path) and listdir(tbds.path) != [] and not force:
     5	datalad/core/local/create.py:                [op.join('.datalad', i[0]) for i in attrs_cfg])
     6	datalad/core/local/create.py:                        op.join('.datalad', p), {}).get(k, None) == v:
     7	datalad/core/local/create.py:                    attrfile=op.join('.datalad', '.gitattributes'))
     8	datalad/core/local/diff.py:                orig_path.endswith(op.sep) or resolved_path == ds.pathobj
     9	datalad/core/local/run.py:from os.path import join as opj
    10	datalad/core/local/run.py:        rel_pwd = op.curdir
    11	datalad/core/local/run.py:        return [d for d in map(op.dirname, gpaths.expand(refresh=True))
    12	datalad/core/local/run.py:        if op.exists(path) or op.lexists(path):
    13	datalad/core/local/run.py:        pwd = normpath(opj(dataset.path, rel_pwd))
    14	datalad/core/local/run.py:        record_dir = ds.config.get('datalad.run.record-directory', default=op.join('.datalad', 'runinfo'))
    15	datalad/core/local/run.py:        record_path = op.join(ds_path, record_dir, record_id)
    16	datalad/core/local/run.py:        if not op.lexists(record_path):
    17	datalad/core/local/run.py:            msg_path = relpath(opj(str(repo.dot_git), "COMMIT_EDITMSG"))
    18	datalad/core/local/status.py:                            not (orig_path.endswith(op.sep) or
    19	datalad/core/local/status.py:                        super_root = get_dataset_root(op.dirname(root))

and about 159 in tests. I think it would be beneficial to get core fully pathlib driven.

May 30 '20 15:05 yarikoptic

No change re state within the past 18 months. Sigh.

Nov 02 '21 09:11 mih

18 left!

Jun 08 '22 06:06 mih

just a note for what kept causing issues for me (even bugs) when switching from os.path to pathlib -- absence of lexists analog in pathlib. So typically any lexists (45 hits ATM) needs to be replaced with smth like p.exists() or p.is_symlink(). IMHO quite suboptimal...

Jun 08 '22 19:06 yarikoptic

I am not sure what is suboptimal here. If you are hinting at making two calls rather than one, I'd say that the actual pathlib replacement for lexists() should use the same approach as os.path.lexists(), which would be

def pathlib_lexists(path):
    try:
        path.lstat()
    except FileNotFoundError:
        return False
    return True

Jun 09 '22 06:06 mih

Related: https://github.com/datalad/datalad/issues/4056

Sep 07 '22 07:09 mih

Note that os.path.normpath has no clear equivalent in pathlib. Only function that resolves .. and . is Path.resolve, but it

is context sensitive and outputs absolute paths
also resolves symbolic links

We need to carefully consider how these differences should be handled.

Sep 13 '22 08:09 kimsin98

The reason why normpath is not in pathlib is that it is not generally safe to use. Consider this:

Setup

❯ mkdir /tmp/norm
❯ ln -s /etc /tmp/norm/dummy

and now this:

>>> normpath('/tmp/norm/dummy/../usr')
'/tmp/norm/usr'

This path simply does not exist.

>>> Path('/tmp/norm/dummy/../usr').resolve()
PosixPath('/usr')

This is "correct".

The difference between normpath and resolve is massive. Not only in correctness, but also in performance. normpath is a purely lexical operation, resolve requires filesystem inspection to give the correct answer.

DataLad still uses normpath in some critical places. All associated functionality is subject to breakage, whenever such conditions are encountered.

Sep 13 '22 09:09 mih

FYI: Python 3.12 will bring a Pathlib alternative to os.walk

❱ git grep 'os.walk'
datalad/core/local/tests/test_save.py:887:             for root, _, fnames in os.walk(op.join(path, save_path or >
datalad/local/check_dates.py:32:        for root, dirs, _ in os.walk(path):
datalad/support/archive_utils_patool.py:180:        for root, dirs, files in os.walk(dir_, followlinks=False):
datalad/support/archives.py:86:        _, dirs, files = next(os.walk(dir_))
datalad/support/archives.py:91:            subdir, subdirs_, files_ = next(os.walk(opj(dir_, dirs[0])))
datalad/support/archives.py:366:        for root, dirs, files in os.walk(path):  # TEMP
datalad/support/tests/test_gitrepo.py:740:    for (dirpath, dirnames, filenames) in os.walk(path):
datalad/utils.py:426:    for dirpath, dirnames, filenames in os.walk(topdir):
datalad/utils.py:494:    for root, dirs, files in os.walk(path, followlinks=False):

Mar 20 '23 09:03 adswa

@jwodder , as you have done quite a bit of timing on os.walk and its multithreading -- are you aware of any performance for that Pathlib alternative to os.walk?

Mar 20 '23 14:03 yarikoptic

@yarikoptic

If you're referring to https://github.com/con/fscacher/pull/67 and related work, that made no use of os.walk().
I know nothing about the performance aspects of Path.walk().

Mar 20 '23 14:03 jwodder