zinc
zinc copied to clipboard
#pathSegments returns segments that are either Character or ByteString
In the current version of Pharo 9 on MacOS (Pharo9.0.0
Build information: Pharo-9.0.0+build.1444.sha.a22a97ca8a50d8c0d470c3d1dde29af0e49a85f3 (64 Bit)), when pathSegments
is called on a url with an empty path a ByteSymbol
is returned.
znUrl := 'https://github.com/feenkcom/gtoolkit/issues/' asZnUrl.
znUrl pathSegments fourth
data:image/s3,"s3://crabby-images/ca39c/ca39c62fab57df8526ac7697aab6577016a72108" alt="Screenshot 2021-06-01 at 17 49 14"
When updating to the latest version of Zinc from this repository, this behaviour changes. A character is returned instead.
We are not sure if this change was on purpose?
Also if there is a path segment then we get a ByteString
:
So one cannot assume pathSegments
returns a collection of Strings.
It was intentional
https://github.com/svenvc/zinc/commit/d962746ae9688550d0c826b9103018a87789afb1#diff-47063f74b25c99af0ddf4f0840769d07aca3728dabb14c78d1f741f685d75235
but I can't immediately remember why we did this, maybe @noha can ...
Yes, I'm a mail archiver. The point was that / is valid as trailing slash and that had the problem that
'http://foo.bar.com/p/' asZnUrl / #otherSegment
gave
http://foo.bar.com/p//otherSegment
(not the double slash). You can see this up to pharo7. Now when you do it you get only a single slash how it should be
Also, Andrei, a URL that ends with a slash is fundamentally different from one that does not. See ZnUrl>>#isDirectoryPath
Maybe #pathSegments exposes a bit too much about the internal implementation. From this perspective, #path might be a better accessor.
What is it that you want to do ?
thanks @svenvc and @noha for the details. I though there was a reason but I did not find that commit.
The code that failed in our case was something like below. On a url that has four segments we were checking if the fourth segment was a number, to validate a url like https://github.com/user/repo/issues/1234
and invalidate https://github.com/user/repo/issues/
.
(NumberParser on: anUrl pathSegments fourth)
failBlock: [ isValid := false. ^ self ];
nextNumber
Now I'll check explicitly for /
since I see it can be a valid path segment.
Hello @svenvc and @noha. I was just randomly reading a URI RFC and accidentally stumbled across a path segment grammar. https://datatracker.ietf.org/doc/html/rfc3986#section-3.3
According to the grammar, a segment can not contain slash /
but can be an empty string.
path-absolute = "/" [ segment-nz *( "/" segment ) ]
...
segment = *pchar
According to RFC, the url from your example should be parsed so that it contains two path segments: p
and a trailing empty segment.
Here are test cases that exemplify this behavior: https://github.com/uriparser/uriparser/blob/1762d5ff025fb07b4b8ccd1a8a9635009b2e9e34/test/test.cpp#L112
Have you considered implementing a separate RFC conform URL parser (with AST nodes), having that there is a grammar available?
### References ABNF for URI: https://datatracker.ietf.org/doc/html/rfc3986#appendix-A
Hi Andrei,
I would like to keep this issue open.
I am not saying that I want to change the current implementation, but you do have a point. I reread the RFC you mentioned.
It strikes me that maybe we could replace the current $/ marker by an empty segment: this would keep the current design, but might result in a less surprising output of the path segments accessor. But I have to think about this more and hopefully I find the time to test it out.
Thanks for the feedback.
Sven
PS: Yes, a parser based on the official RFC ABNF might also be a good idea, but that you be a separate project first.
Hi Sven. I opened the issue again. Thanks for looking into this!
About the idea of “a parser based on the official RFC ABNF”: the PetitParser grammar I gave in issue 100 might serve as a starting point.