spring-hateoas
spring-hateoas copied to clipboard
Link parsing only works for basic cases
We've decided to use your library to parse Link headers, hoping that you'll correctly implement the specification intricacies, because doing parsing correctly is tricky. However, it's implemented completely naively, working only in the basic cases. It uses regex for parsing, even though it can't be used for non-context-free grammars.
Here are a bunch of examples:
// should have failed
Link.valueOf("foo bar <url>;rel=\"next\"");
// greedy capture until the last `>`
System.out.println(Link.valueOf("<url>;customparam=foo>;rel=\"next\"").getHref());
// multiple rels not supported, should have been parsed to a collection
System.out.println(Link.valueOf("<url>;rel=\"next last\"").getRel().value());
// incorrect unquoting of values - missing end quote must be ignored, but currently "nex" is returned (missing last "t")
System.out.println(Link.valueOf("<url>;rel=\"next").getRel().value());
// param value not unescaped
System.out.println(Link.valueOf("<url>;rel=\"next\";title=\"foo\\\"bar\"").getTitle());
// incorrect semicolon handling - a semicolon within a quoted value is not treated literally
System.out.println(Link.valueOf("<url>;rel=\"next\";title=\"foo;bar\"").getRel().value());
This is the ABNF for the Link header:
Link = #link-value
link-value = "<" URI-Reference ">" *( OWS ";" OWS link-param )
link-param = token BWS [ "=" BWS ( token / quoted-string ) ]
And here for the content of the rel param:
relation-type *( 1*SP relation-type )
where:
relation-type = reg-rel-type / ext-rel-type
reg-rel-type = LOALPHA *( LOALPHA / DIGIT / "." / "-" )
ext-rel-type = URI ; Section 3 of [RFC3986]
See https://httpwg.org/specs/rfc8288.html
So there can be multiple values in the rel param, even URIs and quoted URIs.
Some of the errors can't be fixed without breaking b-w compatibility, e.g. the decoding of href, unescaping of params and multiple rel values. Also, I'm not sure why custom parameters aren't supported - I didn't read the whole spec, but I don't think it prohibits custom params (we don't use them, just wondering).
Looking at the RFC more deeply, looks that I'm missing a lot. For example, missing > or missing final quote " MUST be ignored. There's nothing about URL-decoding the href, at least I can't find it. Seems that the particular service I'm working with now got this incorrectly since they are double-escaping their URL, contrary to the standard. There also can be multiple links within a single header, separated by ,, if I understand it correctly.
But many of the points above still apply. The RFC contains precise algorithm for parsing, I would expect a high quality library to strictly follow it, including the edge cases.
Would you mind listing the cases that really do break and are valid link headers according to the specification?
@odrotbohm I edited the original post of this issue. The list isn't complete, to implement correctly, one should follow the specification very closely. There's exact algorithm for the parsing an Appendix B, from which handling of all (most?) edge cases follows (e.g. the ignoring of the missing final quote, if I understand it correctly).
Would you mind turning them into test cases and submitting a PR? I'll see what I can do then. If you fancy taking a stab at a better parsing implementation, feel free to submit that, too.
This is fixed by merging and polishing your PR. Also back-ported to 2.3.x and 2.2.x.
How you renamed the issue made my day: so it wasn't a bug, but a new, advanced feature?? I renamed it back, unfortunatelly, I can't rename the commit...
We needed to parse a Link. My junior colleague wanted to do a quick-and-dirty parser. I told to him that there's an RFC, rather look for a lib because you won't make it correctly. We found this lib, but realized that it had exactly the same issues I worried a junior would make. And now you renamed it to "support for advanced expressions". Is a semicolon in a String literal advanced expression? Just tell me, would you really say so?