ion-python
ion-python copied to clipboard
SimpleIon READ ERROR, 10 unit tests failed consistently.
There are some read issues when using the simpleion module.
Example failed CI/CD workflow - https://github.com/amazon-ion/ion-python/actions/runs/4410728439/jobs/7728500690
1 failed unit test within ion-python is:
test_loads_unicode_utf8_conversion():
9 failed tests from ion-tests are:
bad/clobWithNonAsciiCharacter_ion
bad/utf8/shortUtf8Sequence_3_ion
bad/utf8/outOfUnicodeBounds_1_ion
bad/utf8/outOfUnicodeBounds_2_ion
bad/utf8/wrongUtf8LeadingBits_1_ion
bad/utf8/wrongUtf8LeadingBits_3_ion
bad/utf8/wrongUtf8LeadingBits_2_ion
bad/utf8/shortUtf8Sequence_2_ion
bad/utf8/shortUtf8Sequence_1_ion
Took a look, the root seems to be the recent ion-c change. Need further investigation.
One example repro:
# data = ['-', test, test, test, test, test, ... ,test]
data = "[ '''\u2013''', "
data += 'test, ' * 3000
data += "test ]"
# below throws IERR_INVALID_SYNTAX
simpleion.loads(data, parse_eagerly=True, text_buffer_size_limit=10000)
Detailed traceback:
ionc.ionc_read() --- simpleion.py#L516 ion_reader_next() in simpleIon --- ioncmodule.c#L1345 ion_reader_next() in ion-c --- ion_reader.c#L535 _ion_reader_next_helper --- ion_reader.c#L549 _ion_reader_text_next --- ion_reader_text.c#L199 _ion_reader_text_check_follow_token --- ion_reader_text.c#L234 IERR_INVALID_SYNTAX --- ion_reader_text.c#L701
https://github.com/amazon-ion/ion-python/pull/250 mitigate this issue by building C extension utilizing ion-c 1.1.0. The commit should be reverted once this issue is solved.
After some digging, and applying an ion-c fix that is about to be submitted, I was able to fix the single ion-python unit test (non-ion-tests unit tests). The ion-tests failures took some digging to figure out what was going on.
I'm not sure yet why the change to ion-c v1.1.1 triggered these tests to break, but looking at the failures, this is what I found.
I updated ion_read_file_stream_handler
in order to add some logging so I could see what was happening:
--- a/amazon/ion/ioncmodule.c
+++ b/amazon/ion/ioncmodule.c
@@ -1361,6 +1361,11 @@ iERR ion_read_file_stream_handler(struct _ion_user_stream *pstream) {
PyObject *py_buffer = PyObject_CallMethod(stream_handle->py_file, "read", "O", IONC_STREAM_BYTES_READ_SIZE);
if (py_buffer == NULL) {
+ printf("[ion_read_file_stream_handler] py_buffer == NULL, py_file: \n");
+ PyObject_Print(stream_handle->py_file, stdout, 0);
+ printf("\nException:\n");
+ PyObject_Print(PyErr_Occurred(), stdout, 0);
+ printf("\n");
pstream->limit = NULL;
FAILWITH(IERR_READ_ERROR);
}
This resulted in the following output:
[ion_read_file_stream_handler] py_buffer == NULL, py_file:
<_io.TextIOWrapper name='/Users/glitch/Code/ion-python/vectors/iontestdata/bad/utf8/shortUtf8Sequence_1.ion' mode='r' encoding='utf-8'>
Exception:
<class 'UnicodeDecodeError'>
So, right from the start we're not able to read from the file because the underlying TextIOWrapper
has raised a UnicodeDecodeError
exception. With this error, the unit test immediately fails, all before ion-c even saw any of the data.
The thing to note here, is that the test data that we're trying to give to ion-c is known to be invalid UTF-8, and the reason for the test is to ensure the parser (ion-c in our case) identifies it as such.
This led me to look at how the files were being opened, and I found the _open
function defined in test_vectors.py. So it looks like we were intentionally specifying the encoding. This could be intended for the pure python implementation, but should probably not be used for ion-c. Partly as stated above, but also there are UTF-16 and UTF-32 files within ion-tests, and ion-c does not support anything other than UTF-8. The open function here will open other encodings, and unless I'm mistaken, convert it to UTF-8 as it parses it into python strings causing double validation (python decoding, and ion-c parsing).
I believe not too long ago we discussed only supporting UTF-8, and I believe that was the consensus, but I'm not sure.
I have a PR in the works to update the _open
function, and add the UTF-16 & UTF-32 files to the skip list, but we might want to have some more discussion around that. I'm going to dig a little more in order to see if I can identify why the tests pass prior to v1.1.1, then I'll post the PR.
I spent more time looking into why the ion-c v1.1.1 release would have changed the behavior of the unit test. A bugfix that was included in v1.1.1 changed how user managed stream handlers were called and did not bubble up read errors. This resulted in a bug where an error reading from the handler would be detected as EOF but only after calling next, which resulted in ion_reader_open_stream
returning successfully even if the handler did not.
The python decoder exception described above occurs with both v1.1.0 and v1.1.1, however prior to v1.1.1, if the stream handler returned an error on first call, it would result in ion_reader_open_stream
returning an error, which would cause the ion-c python extension to raise an IonException
which would satisfy the unit test. This would overwrite the decoder's exception and is why, I assume, it wasn't noticed before.
With the v1.1.1 change, that initial error from ion_reader_open_stream
was not returned, which led the ion-c extension to return an iterator. As soon as ionc_read
returned python would pick up on the raised exception from the TextIOWrapper's decoder and bubble it up. The raised UnicodeDecodeError
was not expected by the unit test causing it to fail.
I have a PR that I'll be submitting shortly for ion-c that addresses the very first bug fix (that addressed the non-ion-test unit test failure), along with an update to address the read error bug. That alone will fix the unit test failures, but I also have an ion-python PR that I'll submit to remove the use of text decoding when using the C extension.
It would probably also be a good idea to check for exceptions after read attempts in the C extension. If an underlying IO exception occurs, it would be nice to make sure that gets bubbled to the user, rather than just an ion related exception. I might include that in the PR as well.