param icon indicating copy to clipboard operation
param copied to clipboard

Speeds up the Path and derived Parameters

Open MarcSkovMadsen opened this issue 2 years ago • 7 comments

Closes #889

  • Avoiding to use Parameterized resolve_path function as there is big, big overhead compared to using simple function!
  • Caching the actual resolve to not do this over and over and over again in Panel apps.

If I run pytest tests/testpathparam.py, I see.

main: 57 passed in 0.21s this: 57 passed in 0.19s

So this is not where you see the benefit. That is because you are only testing new instances and new paths. But that is not how Path is being used in practice. For example in Panel the pn.config is rather static and holds paths to css etc. which is being requested every time a page is loaded.

Performance tests

My setup

Python 3.10 on Linux.

Profiling script

You can use the script below.

import pytest
from pyinstrument import Profiler
import param
from pathlib import Path

N = 100000

TESTS_ROOT = Path.cwd()

@pytest.fixture(autouse=True)
def auto_profile(request):
    PROFILE_ROOT = (TESTS_ROOT / ".profiles")
    # Turn profiling on
    profiler = Profiler()
    profiler.start()

    yield  # Run test

    profiler.stop()
    PROFILE_ROOT.mkdir(exist_ok=True)
    results_file = PROFILE_ROOT / f"{request.node.name}.html"
    profiler.write_html(results_file)

class Custom(param.Parameterized):
    path = param.Path("script.py")


def test_class():
    for i in range(N):
        Custom.path

custom = Custom()
def test_existing_instance():
    for i in range(N):
        custom.path

def test_new_instance():
    for i in range(N):
        Custom().path

try:
    from param.parameters import _resolve_path
    def test_no_instance():
        for i in range(N):
            _resolve_path("script.py")
except:
    pass

Results

N=100000

branch test_class test_existing_instance test_new_instance test_no_instance
main 11.5s 11.5s 15.8s
this - pathlib 0.7s 0.7s 4.9s 0.3s
this - os 0.6s 0.7s 4.9s 0.3s

It makes you think about the cost of using Parameterized classes and ParameterizedFunctions! You can compare this to FastApi and Pydantic. Pydantic 2.0 was a reimplementation in Rust to make applications like FastApi fast.

MarcSkovMadsen avatar Nov 12 '23 06:11 MarcSkovMadsen

Additional Speed up if using faster attribute lookup.

If the implementation of Path would not use costly self._resolve_path and self.search_paths attribute look ups then it would again be much faster.

The below was run on this branch and shows this.

image

The implementation would look like

image

Its a matter of whether you want an 8% speed up over code readability for humans and static analysis tools in your editor.

Additional speed up by not looking up cwd.

Path._resolve look up the cwd on each run because the default value of search_paths is None.

At least in Panel we should probably provide the search_paths once on instantiation of pn.config to avoid having this cost over and over again.

Maybe we should even change the default behaviour of Path.search_paths from None to the cwd when instantiated?

MarcSkovMadsen avatar Nov 12 '23 07:11 MarcSkovMadsen

There are broken tests so I assume this is still work in progress?

A test suite is not a benchmark so I'm not surprised we don't see performance penalties or regressions from the test suite itself. We now leverage asv to run some benchmark, feel free to contribute to it https://github.com/holoviz/param/tree/main/benchmarks.

maximlt avatar Nov 12 '23 11:11 maximlt

I hoped to get some feedback first. For example its seems Param does not utilize caching today and the focus is minimum of code. This is a change in favor of speed.

Thus the first step is really to spark the discussion with this PR. And I would like to know if you would like to accept this kind of change before I spend time finalizing.

The tests are breaking on macOS and Windows. To me it looks like the path strings are resolved in different But equivalent ways by os and pathlib.

Having thought about it, I assumed pathlib was better as its "modern" Python. But now I'm not sure. At least I should profile that too.

MarcSkovMadsen avatar Nov 12 '23 11:11 MarcSkovMadsen

Ok. Using pathlib is slow. The image below slows the new implementation without caching on left and old implementation without caching on right.

image

I will revert to using os methods and keep the cache.

MarcSkovMadsen avatar Nov 12 '23 17:11 MarcSkovMadsen

I need to spend more time to review this (not sure I'll have time this week!). My feedback from working on Param 2.0:

  • I'm not surprised there are performance improvements to gain, performance hasn't been a major focus, even if with the addition of the asv benchmark we've been able to catch some regressions and mitigate them.
  • don't trust the test suite! It has many gaps, I can't count how many times I broke Panel or HoloViews without breaking the test suite.
  • Param is so old that many of its bugs have turned into features, changing it requires great care. It's often worth adding tests before making any change, to capture the previous behavior and see how it's affected.

This piece of code breaks with the suggested changes, the last line used to raise an error and it no longer does. It shows that the difficulty with resources like file/directory paths is that they can be removed without Param knowing about it, which is why I think it always resolves. Maybe the way forward is to overload the newly added check_exists attribute (https://github.com/holoviz/param/pull/800) with a new 'once' mode (better name welcome) that only checks whether the resource exists when the Parameter is instantiated.

import os
import param
import pathlib

fp = pathlib.Path('exists.txt')
fp.write_text('')

class P(param.Parameterized):
    path = param.Path('exists.txt')

p = P()
print(p.path)

fp.unlink()

print(p.path)

maximlt avatar Nov 12 '23 23:11 maximlt

Thanks for the feedback. I plan to give this a second iteration when I have gotten panel-chemistry working for Panel 1.x. First thing would be adding the use case you describe to the tests.

MarcSkovMadsen avatar Nov 13 '23 16:11 MarcSkovMadsen

I quickly played around with another implementation (compared to main, not to this branch), (ab)using check_exists.

diff --git a/param/parameters.py b/param/parameters.py
index ed11b53b..8852b79b 100644
--- a/param/parameters.py
+++ b/param/parameters.py
@@ -2706,9 +2706,11 @@ class Path(Parameter):
     ----------
     search_paths : list, default=[os.getcwd()]
         List of paths to search the path from
-    check_exists: boolean, default=True
-        If True (default) the path must exist on instantiation and set,
-        otherwise the path can optionally exist.
+    check_exists: boolean or 'once', default=True
+        If True (default) the path must exist on instantiation and whenever
+        get/set is called, if 'once' the path must only exist on instantiation
+        and on get the first resolved path is returned, if False the path can
+        optionally exist.
     """
 
     __slots__ = ['search_paths', 'check_exists']
@@ -2733,8 +2735,8 @@ class Path(Parameter):
             search_paths = []
 
         self.search_paths = search_paths
-        if check_exists is not Undefined and not isinstance(check_exists, bool):
-            raise ValueError("'check_exists' attribute value must be a boolean")
+        if check_exists is not Undefined and check_exists not in (True, False, 'once'):
+            raise ValueError("'check_exists' attribute value must be a boolean or 'once'")
         self.check_exists = check_exists
         super().__init__(default,**params)
         self._validate(self.default)
@@ -2749,16 +2751,23 @@ class Path(Parameter):
         else:
             if not isinstance(val, (str, pathlib.Path)):
                 raise ValueError(f'{_validate_error_prefix(self)} only take str or pathlib.Path types')
+            path = None
             try:
-                self._resolve(val)
:+                path = self._resolve(val)
             except OSError as e:
-                if self.check_exists:
+                if self.check_exists is True:
                     raise OSError(e.args[0]) from None
+            if path is not None and self.check_exists == 'once':
+                # Abusing check_exists to store the resolved path on parameter
+                # instantiation.
+                self.check_exists = ('once', path)
 
     def __get__(self, obj, objtype):
         """
         Return an absolute, normalized path (see resolve_path).
         """
+        if isinstance(self.check_exists, tuple):
+            return self.check_exists[1]
         raw_path = super().__get__(obj,objtype)
         if raw_path is None:
             path = None

With these benchmarks:


class ParameterPathResolveSuite:

    def setup(self):
        import pathlib
        p = pathlib.Path('dummy')
        p.write_text('dummy')

        class P(param.Parameterized):
            x0 = param.Path(str(p))

        self.P = P
        self.p = P()

    def time_resolve_new(self):
        p = self.P()
        p.x0

    def time_resolve_same(self):
        self.p.x0


class ParameterPathResolveOnceSuite:

    def setup(self):
        import pathlib
        p = pathlib.Path('dummy')
        p.write_text('dummy')

        class P(param.Parameterized):
            x0 = param.Path(str(p), check_exists='once')

        self.P = P
        self.p = P()


    def time_resolve_new(self):
        p = self.P()
        p.x0

    def time_resolve_same(self):
        self.p.x0

Then I started to think about how this might be related to readonly and constant, to Parameter attribute inheritance and to __set__. To finally decide this could wait for January 🙃

maximlt avatar Dec 22 '23 01:12 maximlt