jena icon indicating copy to clipboard operation
jena copied to clipboard

GH-2402: RDF Patch Binary handles malformed inputs better

Open rvesse opened this issue 10 months ago • 3 comments

Discovered in $dayjob work that RDFPatchReaderBinary would silently accept malformed inputs in some cases. Started by adding several test cases, some of which failed, to demonstrate that the RDF Patch Binary reader can silently accept invalid streams. Then debugged from there are introduced additional logic which makes a best effort attempt to distinguish between genuine EOF and EOF due to malformed input that should produce an error.

Unfortunately the Thrift API doesn't have a clean way to detect that you've genuinely reached the EOF so we just have to attempt to read the next row from the patch, and detect the obvious cases of malformed input:

  1. TProtocolUtil.skip() getting called, this indicates that Thrift recognised a Field ID but that the Type ID was not the expected type for that field. So if we hit a EOF while skipping over such a field we're definitely reading malformed input.
  2. TUnion.read() getting called multiple times. This means that we're partway through reading an otherwise valid data structure at the point where we encounter the EOF so again is a clear indicator of malformed input.

I appreciate that this solution is somewhat hacky, if anyone has a more robust solution please feel free to suggest it

GitHub issue resolved #2402


  • [x] Tests are included.
  • [x] Documentation change and updates are provided for the Apache Jena website
  • [x] Commits have been squashed to remove intermediate development commit messages.
  • [x] Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

rvesse avatar Apr 09 '24 08:04 rvesse

Asked about better solutions on the Thrift dev list and the response was basically "your Thrift protocol design is flawed" - https://lists.apache.org/thread/vc4odxroj0l2kx58zh9wk7j8o5487lcl

rvesse avatar Jul 03 '24 10:07 rvesse

The RDF patch protocol has begin and end markers to connect to a transaction. - TX and TC/TA in the text encoding.

An abrupt end will mean no "commit" and the received should abort the change.

For receiving where a transaction isn't available a BufferingDatasetGraph can be useful.

afs avatar Jul 03 '24 21:07 afs

@afs What do we want to do with this PR, it's clear from the Thrift developers response there isn't anything better we can do given how the Thrift is defined for RDF Patch Binary. While I still don't like the hackiness of this solution it at least improves the situation from the current one where some malformed/truncated patches are silently accepted.

The same issue likely also applies to RDF Thrift so may be worth me generalising the hack into one of the Thrift utility classes and augmenting both the RDF Patch Binary and RDF Thrift readers with the change

rvesse avatar Aug 29 '24 09:08 rvesse

A TX-TC pair would give the framing mentioned by the Thrift developers. Comment: https://lists.apache.org/thread/yd8wnvrg7ltg6szl09kk7fpoglxtvkv1

If there is any fault in the stream, then TC isn't seen and the processor should abort. This is robust to any violation of the expected input. This is also good for detecting data after the TC e.g. concatenated legal syntax but an unexpected condition. There are other conditions like data before TX, or no TX ("TX. junk" has a different error message), as is the case for non-patch input.

Non-patch input leads to an empty patch. One possibility is to add RDFPatch.isEmpty which would cover the case of no TX

Inspecting the stack is cause the exception to materialize the stack.

There are other sources of information - if the transport is from a string, it's not truncated. A string is either good or bad. An input stream from a network can be truncated.

The Protobuf form may be more robust. It is slightly slower than Thrift. Thrift parses RDF at up to 1 million triples/s, Protobuf is slower by ~5% because it adds an object wrapper around the unit in the input stream which may give better error visibility. This one extra Java object looks like it is the cause of the 5% but we're into the sub-microsecond area here so quite possible.

afs avatar Aug 30 '24 10:08 afs

So handle via documentation instead i.e. document the potential issue and recommended that users always include transaction markers in patches?

Certainly better than poking around the stack trace which I would really rather avoid

rvesse avatar Aug 30 '24 13:08 rvesse

Yes, I think documentation is the way to go. Stack trace inspection is not great for a long term solution.

Truncation on a patch line boundary isn't caught by Thrift and only TX/TC can detect that as a error. That means documentation is desirable.

The thrift code is used in other places - TDB2 uses it for node encoding (a small unit) so Thrift overhead could show up there.

afs avatar Sep 02 '24 15:09 afs

As discussed closing in favour of documentation instead

rvesse avatar Sep 03 '24 10:09 rvesse