#1622: Fix issues with nested context manager calls
Summary
- fixes #1622
Problem statement
The Resource class is also a Context Manager. That is, it implements the __enter()__ and __exit()__ methods to allow the use of with Resource(...) statements.
Prior to this PR, there was no limit on nesting with statements on the same Resource, but this caused problems because while the second __enter()__ allowed the Resource to already be open, the first __exit()__ would close() the Resource while the higher level context would expect it to still be open.
This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions.
This was made more complex because these with statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc.
All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality.
Solution
This PR prevents nested contexts being created by throwing an exception when the second, nested, with is attempted. This means that code that risks these issues can be quickly identified and resolved during development.
The best way to resolve it is to use Resource.to_copy() to copy so that the nested with is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway.
This PR also updates a number of the internal uses of with to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called.
Breaking Change
This is technically a breaking change as any external code that was developed using nested with statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use to_copy() or similar.
However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. to_petl() iterators are now independent rather than causing changes in each other)
So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions.
Tests
- All existing unit tests run and pass unchanged
- New unit tests were added to cover the updated behaviour -- These unit tests were confirmed to fail without the updates in this PR (where appropriate). -- These unit tests now pass with the updated code.
- The original script that identified the issue in #1622 was run and now gives the correct result (all rows appropriately converted and saved to file)
Thanks @richardt-engineb !
It's a complex change I will ask for a peer review in the chat
Thanks for the commens @augusto-herrmann (and sorry about the delay in replying)
I have never used nested
withstatements withResourceobjects, so I do not think this would directly affect our own existing code.
It's good to hear that it wouldn't affect your own use of Frictionless. The main cases that would affect would be where things in the library happen to be called in such a way that nested with statements happen. E.g. when using a pipeline with steps, if they re-uses the same resource object the whole way through (and not copies) you can end up with nested with statements (usually related to to_petl()-based generators etc.)
However, by looking the code on this PR I can notice that a lot of internal code of the Frictionless Framework was changed to use
Resource.to_copy()when using context managers. Does copying a Resource object entail copying the data as well or just metadata? If this does indeed create an in-memory copy of the whole resource, including data, this is likely to increase memory usage, which might be an issue if the Resource includes a lot of data.It would be nice to make a test comparing memory usage and running time before and after the change, especially with large Resources.
to_copy() is:
def to_copy(self, **options: Any) -> Self:
"""Create a copy from the resource"""
return super().to_copy(
data=self.data,
basepath=self.basepath,
detector=self.detector,
package=self.package,
**options,
)
The main thing in there that could take up space is indeed the data which could noticeably increase the space used. However the mitigation would be that if using very large datasets it seems likely that data would come from loading a file which is only using generators/iterators rather than caching the whole file in memory. And conversely, anywhere that full in-memory data is being used is likely to be small enough that the memory increase is not a concern overall.
Are there any tests you would specifically like to see to increase your confidence, or use-cases where huge in-memory data sets are be used in ways that this would make harder?
Even in this change, someone with huge in-memory data-sets could use open and close explicitly to avoid copies. That wouldn't affect the internals however, so it does become a question of correctness vs memory use. I did consider doing reference counting instead of copying, however that still had the problem where the state of the generators/iterators was inconsistent depending on what else had been called (e.g. pipeline ordering, etc). So e.g. the tests/table/test_to_petl.py tests that I added do not pass without the to_copy() changes I made (and would not pass in a reference counting approach).
I also considered adding a SAFE_ITERATORS or similar flag that would either copy or not copy depending on value, but that also seemed like asking for trouble in the future as when the issues happen they are very subtle and hard to understand. So it would almost need to be a DANGEROUS_THINGS_WILL_EVENTUALLY_GO_WRONG flag or something!
As I said, thanks for the feedback, and happy to look at any additional tests etc. that you think would conclusively show a meaningful difference that would convince you either way.