dvc icon indicating copy to clipboard operation
dvc copied to clipboard

list: optimize -R/--recursive

Open efiop opened this issue 3 years ago • 3 comments

http://bench.dvc.org/ Screenshot 2022-08-16 at 15 32 39

efiop avatar Aug 16 '22 12:08 efiop

A significant part of the recent performance regression comes from 496599518a2f79a79b63888a1d9eaa30d8712021, specifically this line: https://github.com/iterative/dvc/blob/3c3cbee7fddd7a087cda5b1e0c821786bdd57be9/dvc/repo/ls.py#L88

We could swipe that issue under the rug by swapping the arguments of the or, but looking at the rest of the code in _ls(), that fs.isdvc() seems redundant.

But the main issue seems to be that there are a lot of redundant path operations, splitting and recombining them multiple times in _DvcFileSystem.info() and similar methods.

rlamy avatar Aug 17 '22 18:08 rlamy

📉

Screenshot 2022-08-24 at 13 35 03

efiop avatar Aug 24 '22 10:08 efiop

After #8241 is merged, there will still be a few easy targets for optimisation:

  • _DvcFileSystem.ls() shouldn't call ._info() but rather directly gather the relevant info from the underlying fs/dvc_fs, using .ls(.., detail=True).
  • _info() ends up calling _update() on every file that is passed to it, it should probably only do that for directories. (Note that the impact of this is compounded by the previous point).
  • In dvc.repo.ls._ls(), the following lines take 30% of the run-time just to convert filenames into paths: https://github.com/iterative/dvc/blob/8432de5980d42d3929b7df7741eb8ed2134a0c74/dvc/repo/ls.py#L73-L76

rlamy avatar Sep 08 '22 14:09 rlamy

Turns out I accidentally re-discovered and fixed all three points from @rlamy 's research ^

Fixed by https://github.com/iterative/dvc/pull/8779 and https://github.com/iterative/dvc/pull/8780 . Closing.

Kudos @rlamy 🙏

efiop avatar Jan 09 '23 20:01 efiop

For the record, this is how it looks now Screenshot 2023-01-09 at 22 02 03

efiop avatar Jan 09 '23 20:01 efiop