cloudpathlib icon indicating copy to clipboard operation
cloudpathlib copied to clipboard

Raise exception when open with write mode in call stack

Open jayqi opened this issue 4 years ago • 13 comments

A pass at trying to detect when __fspath__ is called by open with a write mode. IMO it doesn't work all that well. It appears open is a weird function that doesn't have its own frame in the call stack. It jumps from __fspath__'s frame directly to the frame that calls open. I don't think there's a straightforward way to get object references from the frames (it's bytecode or source), so also hard to reliably detect if open is being called on the cloud path instance, or to pick up on what mode open is being called with.

jayqi avatar Apr 03 '21 01:04 jayqi

🚀 Deployed on https://deploy-preview-140--gallant-agnesi-5f7bb3.netlify.app

github-actions[bot] avatar Apr 03 '21 01:04 github-actions[bot]

Codecov Report

Merging #140 (e6c2bae) into master (ecf787e) will decrease coverage by 0.0%. The diff coverage is 91.8%.

@@           Coverage Diff            @@
##           master    #140     +/-   ##
========================================
- Coverage    93.5%   93.5%   -0.1%     
========================================
  Files          21      21             
  Lines        1119    1154     +35     
========================================
+ Hits         1047    1079     +32     
- Misses         72      75      +3     
Impacted Files Coverage Δ
cloudpathlib/cloudpath.py 90.6% <91.1%> (+<0.1%) :arrow_up:
cloudpathlib/client.py 87.5% <100.0%> (ø)
cloudpathlib/exceptions.py 100.0% <100.0%> (ø)
cloudpathlib/s3/s3client.py 92.8% <0.0%> (ø)

codecov[bot] avatar Apr 03 '21 01:04 codecov[bot]

Ok @jayqi, this really sent me down a 🐰 🕳️ ...

Here's a notebook showing a few more options: https://gist.github.com/pjbull/1b8f92b84a40cd6fe2033468f98594a4

Thoughts on those?

pjbull avatar Apr 03 '21 18:04 pjbull

Wow, well done. 👏 It looks to me like you're capturing the main cases. I think grabbing the entire source of the frame and parsing the AST is probably the only reliable way to do this. One concern I have is how much overhead this adds to each call, and how it scales with the frame's size.

jayqi avatar Apr 03 '21 18:04 jayqi

Yeah, that's exactly my concern too. I'll see if I can add some benchmarking. No clue how long those things take....

pjbull avatar Apr 03 '21 18:04 pjbull

OK, I updated that gist above with some benchmarking. Here are my takeaways:

  • Adding the check makes the __fspath__ take ~10x the time of having no check
  • On my machine, the slowdown is in the same ballpark as writing a 48KB file or reading a 1.6MB on. As file sizes grow larger than those, the slowdown will start to be overwhelmed by the cost of reading/writing.
  • inspect.getsource is the most expensive call. I didn't track down a faster call for this in some poking around.
  • Neither AST parsing nor regex change execution time appreciably, even for reasonably large code contexts, so choosing one of those should not be based on performance.

So, I think if we go with Option 1 as outlined in #128 we should: (1) add the AST version based on getcurrentframe which is fast and robust as compared to the other options, and (2) implement an environment variable to allow opting out of the performance hit if you are sure you don't use open with cloudpaths.

For me, options 2-4 as outlined in #128 are also still on the table given this research.

pjbull avatar Apr 04 '21 07:04 pjbull

Do you think inspect.getsource might have a different speed if it's getting the source from an installed version of cloudpathlib? (Editable and normal)

6 ms wall time per run isn't untenably slow but I guess not trivial. Adds a minute to 10,000 files.

jayqi avatar Apr 04 '21 20:04 jayqi

Do you think inspect.getsource might have a different speed if it's getting the source from an installed version of cloudpathlib? (Editable and normal)

I think since the frame we want to look at is from the caller's code, it will depend on how their code is installed so not sure that we can speed it up.

I did just poke at bytecode disassembly with dis as well, and that gave me the same order of magnitude as getsource (~5ms).

Agreed that 6ms isn't large enough to rule this approach out.

I think to improve the user friendliness of working with paths, I'm inclined towards this change.

pjbull avatar Apr 04 '21 20:04 pjbull

@jayqi Check out the latest commits. This seems to work, but it does add a ton of complex code for a narrow-ish use case.

pjbull avatar Apr 12 '21 06:04 pjbull

Another rev here. I thought that with the AST we lost the line numbers for matching writeable opens we detected and the entry point from calls, but it turns out that with a little bit of tracking, we can use both the AST approach and raise the error on the right line.

For example, here is the right trace in a notebook: image

Tests path locally on 3.8 on linux as well, so there is likely some fix needed for Python<3.8

Small bonus fix: silence warnings in tests by checking for attributes in __del__ methods

pjbull avatar Apr 14 '21 17:04 pjbull

Uh oh, did we end up using a Python 3.8-only feature? It looks like the 3.8 build passed but 3.6 and 3.7 failed.

jayqi avatar Apr 14 '21 23:04 jayqi

Yeah, see above:

Tests path locally on 3.8 on linux as well, so there is likely some fix needed for Python<3.8

I'm seeing differences in what line inspect reports for a given frame across versions. Need to dig in more to find a fix.

pjbull avatar Apr 14 '21 23:04 pjbull

OK, I think this is ready for review. It is passing all the tests.

I do think that this is a pretty substantial amount of complexity to add for maitenance, so we may want to turn this off by default if we get lots of bug reports that come up in this code block.

I wouldn't be surprised if there are cases where the inspect calls don't do what we expect given all the different ways one can run Python.

Performance update


Here's info on timing for this implementation:

Script:

from cloudpathlib import CloudPath
import statistics
import time


if __name__ == "__main__":
    p = CloudPath('s3://cloudpathlib-test-bucket/5bmg9iTs74wmewasLTP44W-test_cloudpath_file_io-test_file_discovery/dir_0/file0_0.txt')

    it = 100
    res = []

    for _ in range(it):
        start = time.time()

        with open(p, "r"):
            pass

        res.append(time.time() - start)

    print(f"Median time {it} iterations: {statistics.median(res)}")

Results:

Around +0.05s on average for each call to open

❯ CLOUDPATHLIB_CHECK_UNSAFE_OPEN=True python timing.py
Median time 100 iterations: 0.2683759927749634

❯ CLOUDPATHLIB_CHECK_UNSAFE_OPEN=False python timing.py
Median time 100 iterations: 0.2952016592025757

pjbull avatar May 14 '21 20:05 pjbull