cloudpathlib
cloudpathlib copied to clipboard
Raise exception when open with write mode in call stack
A pass at trying to detect when __fspath__ is called by open with a write mode. IMO it doesn't work all that well. It appears open is a weird function that doesn't have its own frame in the call stack. It jumps from __fspath__'s frame directly to the frame that calls open. I don't think there's a straightforward way to get object references from the frames (it's bytecode or source), so also hard to reliably detect if open is being called on the cloud path instance, or to pick up on what mode open is being called with.
🚀 Deployed on https://deploy-preview-140--gallant-agnesi-5f7bb3.netlify.app
Codecov Report
Merging #140 (e6c2bae) into master (ecf787e) will decrease coverage by
0.0%. The diff coverage is91.8%.
@@ Coverage Diff @@
## master #140 +/- ##
========================================
- Coverage 93.5% 93.5% -0.1%
========================================
Files 21 21
Lines 1119 1154 +35
========================================
+ Hits 1047 1079 +32
- Misses 72 75 +3
| Impacted Files | Coverage Δ | |
|---|---|---|
| cloudpathlib/cloudpath.py | 90.6% <91.1%> (+<0.1%) |
:arrow_up: |
| cloudpathlib/client.py | 87.5% <100.0%> (ø) |
|
| cloudpathlib/exceptions.py | 100.0% <100.0%> (ø) |
|
| cloudpathlib/s3/s3client.py | 92.8% <0.0%> (ø) |
Ok @jayqi, this really sent me down a 🐰 🕳️ ...
Here's a notebook showing a few more options: https://gist.github.com/pjbull/1b8f92b84a40cd6fe2033468f98594a4
Thoughts on those?
Wow, well done. 👏 It looks to me like you're capturing the main cases. I think grabbing the entire source of the frame and parsing the AST is probably the only reliable way to do this. One concern I have is how much overhead this adds to each call, and how it scales with the frame's size.
Yeah, that's exactly my concern too. I'll see if I can add some benchmarking. No clue how long those things take....
OK, I updated that gist above with some benchmarking. Here are my takeaways:
- Adding the check makes the
__fspath__take ~10x the time of having no check - On my machine, the slowdown is in the same ballpark as writing a 48KB file or reading a 1.6MB on. As file sizes grow larger than those, the slowdown will start to be overwhelmed by the cost of reading/writing.
inspect.getsourceis the most expensive call. I didn't track down a faster call for this in some poking around.- Neither AST parsing nor regex change execution time appreciably, even for reasonably large code contexts, so choosing one of those should not be based on performance.
So, I think if we go with Option 1 as outlined in #128 we should:
(1) add the AST version based on getcurrentframe which is fast and robust as compared to the other options, and (2) implement an environment variable to allow opting out of the performance hit if you are sure you don't use open with cloudpaths.
For me, options 2-4 as outlined in #128 are also still on the table given this research.
Do you think inspect.getsource might have a different speed if it's getting the source from an installed version of cloudpathlib? (Editable and normal)
6 ms wall time per run isn't untenably slow but I guess not trivial. Adds a minute to 10,000 files.
Do you think inspect.getsource might have a different speed if it's getting the source from an installed version of cloudpathlib? (Editable and normal)
I think since the frame we want to look at is from the caller's code, it will depend on how their code is installed so not sure that we can speed it up.
I did just poke at bytecode disassembly with dis as well, and that gave me the same order of magnitude as getsource (~5ms).
Agreed that 6ms isn't large enough to rule this approach out.
I think to improve the user friendliness of working with paths, I'm inclined towards this change.
@jayqi Check out the latest commits. This seems to work, but it does add a ton of complex code for a narrow-ish use case.
Another rev here. I thought that with the AST we lost the line numbers for matching writeable opens we detected and the entry point from calls, but it turns out that with a little bit of tracking, we can use both the AST approach and raise the error on the right line.
For example, here is the right trace in a notebook:

Tests path locally on 3.8 on linux as well, so there is likely some fix needed for Python<3.8
Small bonus fix: silence warnings in tests by checking for attributes in __del__ methods
Uh oh, did we end up using a Python 3.8-only feature? It looks like the 3.8 build passed but 3.6 and 3.7 failed.
Yeah, see above:
Tests path locally on 3.8 on linux as well, so there is likely some fix needed for Python<3.8
I'm seeing differences in what line inspect reports for a given frame across versions. Need to dig in more to find a fix.
OK, I think this is ready for review. It is passing all the tests.
I do think that this is a pretty substantial amount of complexity to add for maitenance, so we may want to turn this off by default if we get lots of bug reports that come up in this code block.
I wouldn't be surprised if there are cases where the inspect calls don't do what we expect given all the different ways one can run Python.
Performance update
Here's info on timing for this implementation:
Script:
from cloudpathlib import CloudPath
import statistics
import time
if __name__ == "__main__":
p = CloudPath('s3://cloudpathlib-test-bucket/5bmg9iTs74wmewasLTP44W-test_cloudpath_file_io-test_file_discovery/dir_0/file0_0.txt')
it = 100
res = []
for _ in range(it):
start = time.time()
with open(p, "r"):
pass
res.append(time.time() - start)
print(f"Median time {it} iterations: {statistics.median(res)}")
Results:
Around +0.05s on average for each call to open
❯ CLOUDPATHLIB_CHECK_UNSAFE_OPEN=True python timing.py
Median time 100 iterations: 0.2683759927749634
❯ CLOUDPATHLIB_CHECK_UNSAFE_OPEN=False python timing.py
Median time 100 iterations: 0.2952016592025757