framework
framework copied to clipboard
frictionless transform unhandled exception on 200 line csv, but not 197 line csv
Overview
I am finding a very strange error when doing a transfrom (either in python code or via the command line tool). Depending on the size of the input file the transform succeeds fine, or throws an "I/O operation on closed file" exception. The number of lines required to trigger it seems to vary, even by execution environment.
On a M1 Mac Mini it's currently 198 lines crashes, 197 lines passes. On a gitpod instance (Ubuntu), it was around the same yesterday, but today is more like 150. In our code version it can take 10k lines+. But there is always a size above which this fails (and a size far short of e.g. settings.FIELD_SIZE_LIMIT).
Example Command Line
% frictionless transform data/crash-transform/data.csv --pipeline data/crash-transform/pipeline.json
╭─ Error ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [step-error] Step is not valid: "cell_replace" raises "[step-error] Step is not valid: "table_normalize" raises "[source-error] The data source has not supported or has │
│ inconsistent contents: I/O operation on closed file. " " │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Pipeline.json
{
"steps": [
{ "type": "table-normalize" },
{ "type": "cell-replace", "pattern": "BLANK", "replace": "" },
{ "type": "cell-replace", "pattern": "blank", "replace": "" },
{ "type": "cell-replace", "pattern": "NULL", "replace": "" },
{ "type": "cell-replace", "pattern": "null", "replace": "" },
{
"name": "NewSumField",
"type": "field-add",
"formula": "Field1 + Field2"
},
{ "name": "NewConstantField", "type": "field-add", "value": "NewValue" }
]
}
data.csv
Field1,Field2,Random1,Random2,Random3,Random4,Random5,Random6,Random7,Random8,Random9
0,0,BLANK,blank,NULL,null,5val0,6val0,7val0,8val0,9val0
1,10,1val1,2val1,3val1,4val1,5val1,6val1,7val1,8val1,9val1
2,20,1val2,2val2,3val2,4val2,5val2,6val2,7val2,8val2,9val2
3,30,BLANK,2val3,3val3,4val3,5val3,6val3,7val3,8val3,9val3
... extend as needed ...
Sample files
I've been doing some debugging of this and have some ideas what it might be related to, but I'm relatively new to Frictionless so I may be confused about the architecture or the specifics.
resource.py enter and exit not re-entrant
There seems to be an issue where __enter__ and __exit__ are not re-entrant but are called multiple times.
__enter__ is:
# TODO: shall we guarantee here that it's at the beggining for the file?
# TODO: maybe it's possible to do type narrowing here?
def __enter__(self):
if self.closed:
self.open()
return self
__exit__ is:
def __exit__(self, type, value, traceback): # type: ignore
self.close()
So if you do something like (pseudo-code):
resource = Resource()
with resource as r1:
# A1
with resource as r2:
# B1
# A2
Then __enter__ is called twice. The first time does self.open(), then second time just returns. When the first with ends, then __exit__ will be called and thus self.close() is immediately called. So at point A2, you are still within the first with but the resource is now closed.
If you started calling a generator on the resource (e.g. row_stream()) at A1 it works, but if you call it again at A2 the resource is closed and the iteration will (eventually) fail.
Why eventually fail (and not immediately)
I believe the number of rows matters because the csv parser takes a sample, and then iterates through chain(sample, self.loader.text_stream). So my belief is that if the file fits within the sample then the self.close() doesn't matter because the sample has already been read. But if you read beyond the size of sample then you are into the self.loader.test_stream which has been closed as above.
Other concerns
The comment # TODO: shall we guarantee here that it's at the beggining for the file? seems to suggest that the expectation is that every with should start at the beginning of the file, whereas the current code just returns self which seems like it will be whatever the current state of the file is (or state of the parser -> loader I think?).
Proposed solutions
I can see a number of ways to potentially fix this, but my limited understanding of the architecture and the central nature of resource.py makes me cautious about going too far down one route without advice from more knowledgeable people. I'm happy to work on the options and produce a PR given some advice on the best way to go.
Option 1: Make enter and exit reference counting
We could add a reference count so that __exit__ only calls close() when the number of __exit__s match the number of __enter__s. this would mean that the Resource was still open at A2 above. I have tested this and it seems to work (i.e. can process large files without crashing).
However I think I am then seeing other weird behaviour like our code only starts outputing line 297 of the input rather than line 1, which I suspect is due to this "re-open in the current state (e.g. first 100 lines previously read into sample) rather than starting at the beginning" behaviour but haven't done enough debugging to prove.
Option 2: Make enter return a deep clone of the Resource with a reset parser/loader as needed
This would mean that in the above pseudo-code r1 and r2 are actually different objects with different independent views on the same file, and closing r2 would have no affect on r1. This seems broadly the safest, and would be in agreement with the comment in the code about being at the beginning of the file. However, it's also the most significant change and I'm not sure if it is in line with the expected architecture.
With enough nesting or chaining, you could also presumably reach a point where you can't open any more connections to the file (particularly remote files).
Option 3: something else
Perhaps we are using it wrong, or I'm misunderstanding entirely what is going wrong, so there may be an entirely different way and better way to fix it.
Further Investigation: Option 2 seems correct, but design goals confirmation needed
Summary of proposal(s)
See details below, but from further investigation I believe that Option 2 - returning a clone/copy - is the correct solution, but there are still some decisions on the right way to do it. Essentially we could:
- Option 2a: always RETURN A COPY from
Resource::__enter__(). This is easy, but requires everyone to use the fullwith resource as target:format (and not justwith resouce:which doesn't use the returned copy and instead expects the resource to be mutated in place). The code uses both styles ofwithat present. - Option 2b: make
Resouce::__enter__()single use by throwing an exception if it's called multiple times (due to nested contexts). Callers would explicitly have toto_copy()before nesting withs. This is like 2a but potentially more explicit. - Option 2B: Make
Resourcebe some kind of FACADE WITH INTERNAL CONTEXT STACK/CACHE with a new context being pushed in__enter__()and popped in__exit__(). This is hard, but would allow both types ofwithto work (with the state of the Resource being the state at the top of the context stack)
I believe that Option 2a is the simplest and best solution, with Option 2b a close second, but I'm keen to get thoughts from more experience users before making a PR.
Simplified error demo for discussion
As discussed in the previous comment, there is a relatively simple reproduction case. The pseudo code for that is below, but I'm also attaching a file of working demo (remove the .txt used to allow github to upload it): crash.py.txt
resource = TableResource(file="data.csv")
with resource as r1:
# A1
# iterate r1.row_stream() 100 times and print out the row
with resource as r2:
# B1
# iterate r2.row_stream() 50 times and print out the row
# A2
# iterate r1.row_stream() another 100 times and print out the row
With no changes this crashes at # A2 with ValueError: I/O operation on closed file. as the end of the with resource as r2: closes the resource. While this might seem like a case of "don't do that", this does actually happen in pipelines within the step functions etc.
Why not Option 1 (reference counting)?
By reference counting, I mean increment a counter whenever __enter__() is called, decrement it in __exit__(), and only call close() when the ref count reaches 0 in __exit__().
This prevents the crash, but the output of the above code would be:
- At
A1it outputs rows 1...100 - At
B1it outputs rows 101...150 (i.e. continues from the state of r1) - At
A2it outputs rows 151...251 (i.e. again continues from the state of r2)
While this might be understandable in the above file, it often happens deeply nested in the pipeline in ways that wouldn't be expected and give the wrong result. e.g. it happens when cell-replace for all cells reads the file to get the header row (in the petl function fieldnames in return convert(table, fieldnames(table), *args, **kwargs) in convertall().
This means that if we implement reference counting, and run the pipeline above it does run to completion but skips the first 297 rows in the output (as those rows are consumed setting up the pipeline, and aren't available for yielding by the time the whole generator chain gets row_stream-ed to write the file).
Option 2 - cloning the resource - works as expected
Option 2a - simple return a copy
Change the __enter__() in resource.py to:
def __enter__(self):
copy = self.to_copy()
copy.open()
return copy
This also prevents the crash, and the output of the above code would be:
- At
A1it outputs rows 1...100 - At
B1it outputs rows 1...50 (i.e. r2 is its own context that starts again at the start of the file) - At
A2it outputs rows 101...201 (i.e. r1 continues where it left off in A1, and is not affected by B1)
However, more globally this works if and only if you go through and change all usages of with resource: to with resource as copy: style. This is neccessary so that (a) you actually use the copy in the subsequent code, and (b) you close the copy. If you use the basic with resource: you are expecting the resource itself to opened, and this code doesn't do that (for the same reason the existing code doing that can be an issue).
This can be resolved by going through the library and fixing all the cases to use the with...as... style, but also requires all future code to also follow that pattern. Note that as far as I know, there is no way for __enter__() to know if its being used in the with ... as style or just with, so there's no way to throw an exception or similar to guarantee the correct usage. Throwing a descriptive exception on closing an already closed Resource might be a partial help, but more likely the code crashes in various "resource not open" places before that (though those could also be updated to give descriptive messages to explain the issue).
This would also be a breaking change for any external code that uses with without as, though I would argue that that code is already either subtly broken or works by chance rather than design given the issues discussed above.
I have implemented this fix in a private copy and it does work for the pipeline given at the start, and does return the correct rows (starting at row 1 as expected). So I am confident this can be made to work in a relatively neat PR.
Option 2b - make Resources explicitly single-use
We could make Resources explicitly single-use by throwing an exception if __enter__() is called on a resource that has been previously entered. This is similar to the above, but instead of doing it internally and forcing the use of with ... as this would make it external and force the use of resource.to_copy() before any nested with.
The benefit is that you get an explicit exception when you do the wrong thing.
The downside is that you have to update every occurence of a nested with, not just those that don't use with ... as..., and makes the code somewhat more verbose.
I have not implemented this, but it is really the same as 2a with the code changes in a different place.
Option 2c - caching context stack
You could in theory do something similar to 2a, but keep the copy on an internal stack. Every time __enter__() is called you push a new context on the stack, and every __exit__() pops it. The Resource class would use the latest state, and on exit would have it's state entirely updated with the state before the previous enter.
This would make either form of with work, but is much more complicated, particularly with the various specialisations of Resource (TableResource, JsonResource, etc), the dynamic metadata, etc.
I have not implemented this, and I'm somewhat concerned about the scale and complexity of trying to implement an appropriate facade and context stack.
Option 3 - something I've missed?
I'd also be very happy to hear that there's something obvious I've missed, and there's a different and simpler way to to fix it!.
Discussion in relation to PEP 343 – The “with” Statement
These issues and possible solutions are somewhat discussed in the PEP that defined the with statement: https://peps.python.org/pep-0343/#caching-context-managers
Many context managers (such as files and generator-based contexts) will be single-use objects. Once the exit() method has been called, the context manager will no longer be in a usable state (e.g. the file has been closed, or the underlying generator has finished execution).
Requiring a fresh manager object for each with statement is the easiest way to avoid problems with multi-threaded code and nested with statements trying to use the same context manager. It isn’t coincidental that all of the standard library context managers that support reuse come from the threading module - they’re all already designed to deal with the problems created by threaded and nested usage.
We could enforce Resources to be single-use objects as they are essentially generators which the PEP suggests are single-use, but that seems to go against as the architecture as there are numerous places where with statements are nested instead other with statements (though disguised by the composable nature of pipeline steps etc.). Option 2a and 2b are both really a way to make resources be single-use in practice - by using a new single-use copy for each nested with - either explicitly or implicitly.
Equally they do suggest the use of caching contexts as per Option 2B, and it seems something like this is implemented for Decimal precision (see example 8 in https://peps.python.org/pep-0343/#examples). That does however seem a lot harder in the Resource case due to the complexity of Resources.