datalad icon indicating copy to clipboard operation
datalad copied to clipboard

embrace pathlib fully in .core?

Open yarikoptic opened this issue 5 years ago • 7 comments

Currently I see 19 uses in code

$> git grep -e '\<op[j.]' -- datalad/core | grep -v test_ | nl
     1	datalad/core/distributed/clone.py:                config_content = op.read_file(cfg_path)
     2	datalad/core/distributed/clone.py:                config_content = op.read_file(cfg_path)
     3	datalad/core/local/create.py:            op.normpath(op.join(str(path), os.pardir)))
     4	datalad/core/local/create.py:        if op.isdir(tbds.path) and listdir(tbds.path) != [] and not force:
     5	datalad/core/local/create.py:                [op.join('.datalad', i[0]) for i in attrs_cfg])
     6	datalad/core/local/create.py:                        op.join('.datalad', p), {}).get(k, None) == v:
     7	datalad/core/local/create.py:                    attrfile=op.join('.datalad', '.gitattributes'))
     8	datalad/core/local/diff.py:                orig_path.endswith(op.sep) or resolved_path == ds.pathobj
     9	datalad/core/local/run.py:from os.path import join as opj
    10	datalad/core/local/run.py:        rel_pwd = op.curdir
    11	datalad/core/local/run.py:        return [d for d in map(op.dirname, gpaths.expand(refresh=True))
    12	datalad/core/local/run.py:        if op.exists(path) or op.lexists(path):
    13	datalad/core/local/run.py:        pwd = normpath(opj(dataset.path, rel_pwd))
    14	datalad/core/local/run.py:        record_dir = ds.config.get('datalad.run.record-directory', default=op.join('.datalad', 'runinfo'))
    15	datalad/core/local/run.py:        record_path = op.join(ds_path, record_dir, record_id)
    16	datalad/core/local/run.py:        if not op.lexists(record_path):
    17	datalad/core/local/run.py:            msg_path = relpath(opj(str(repo.dot_git), "COMMIT_EDITMSG"))
    18	datalad/core/local/status.py:                            not (orig_path.endswith(op.sep) or
    19	datalad/core/local/status.py:                        super_root = get_dataset_root(op.dirname(root))

and about 159 in tests. I think it would be beneficial to get core fully pathlib driven.

yarikoptic avatar May 30 '20 15:05 yarikoptic

No change re state within the past 18 months. Sigh.

mih avatar Nov 02 '21 09:11 mih

18 left!

mih avatar Jun 08 '22 06:06 mih

just a note for what kept causing issues for me (even bugs) when switching from os.path to pathlib -- absence of lexists analog in pathlib. So typically any lexists (45 hits ATM) needs to be replaced with smth like p.exists() or p.is_symlink(). IMHO quite suboptimal...

yarikoptic avatar Jun 08 '22 19:06 yarikoptic

I am not sure what is suboptimal here. If you are hinting at making two calls rather than one, I'd say that the actual pathlib replacement for lexists() should use the same approach as os.path.lexists(), which would be

def pathlib_lexists(path):
    try:
        path.lstat()
    except FileNotFoundError:
        return False
    return True

mih avatar Jun 09 '22 06:06 mih

Related: https://github.com/datalad/datalad/issues/4056

mih avatar Sep 07 '22 07:09 mih

Note that os.path.normpath has no clear equivalent in pathlib. Only function that resolves .. and . is Path.resolve, but it

  1. is context sensitive and outputs absolute paths
  2. also resolves symbolic links

We need to carefully consider how these differences should be handled.

kimsin98 avatar Sep 13 '22 08:09 kimsin98

The reason why normpath is not in pathlib is that it is not generally safe to use. Consider this:

Setup

❯ mkdir /tmp/norm
❯ ln -s /etc /tmp/norm/dummy

and now this:

>>> normpath('/tmp/norm/dummy/../usr')
'/tmp/norm/usr'

This path simply does not exist.

>>> Path('/tmp/norm/dummy/../usr').resolve()
PosixPath('/usr')

This is "correct".

The difference between normpath and resolve is massive. Not only in correctness, but also in performance. normpath is a purely lexical operation, resolve requires filesystem inspection to give the correct answer.

DataLad still uses normpath in some critical places. All associated functionality is subject to breakage, whenever such conditions are encountered.

mih avatar Sep 13 '22 09:09 mih

FYI: Python 3.12 will bring a Pathlib alternative to os.walk

❱ git grep 'os.walk'
datalad/core/local/tests/test_save.py:887:             for root, _, fnames in os.walk(op.join(path, save_path or >
datalad/local/check_dates.py:32:        for root, dirs, _ in os.walk(path):
datalad/support/archive_utils_patool.py:180:        for root, dirs, files in os.walk(dir_, followlinks=False):
datalad/support/archives.py:86:        _, dirs, files = next(os.walk(dir_))
datalad/support/archives.py:91:            subdir, subdirs_, files_ = next(os.walk(opj(dir_, dirs[0])))
datalad/support/archives.py:366:        for root, dirs, files in os.walk(path):  # TEMP
datalad/support/tests/test_gitrepo.py:740:    for (dirpath, dirnames, filenames) in os.walk(path):
datalad/utils.py:426:    for dirpath, dirnames, filenames in os.walk(topdir):
datalad/utils.py:494:    for root, dirs, files in os.walk(path, followlinks=False):

adswa avatar Mar 20 '23 09:03 adswa

@jwodder , as you have done quite a bit of timing on os.walk and its multithreading -- are you aware of any performance for that Pathlib alternative to os.walk?

yarikoptic avatar Mar 20 '23 14:03 yarikoptic

@yarikoptic

  • If you're referring to https://github.com/con/fscacher/pull/67 and related work, that made no use of os.walk().
  • I know nothing about the performance aspects of Path.walk().

jwodder avatar Mar 20 '23 14:03 jwodder