zinc #pathSegments returns segments that are either Character or ByteString

In the current version of Pharo 9 on MacOS (Pharo9.0.0 Build information: Pharo-9.0.0+build.1444.sha.a22a97ca8a50d8c0d470c3d1dde29af0e49a85f3 (64 Bit)), when pathSegments is called on a url with an empty path a ByteSymbol is returned.

znUrl := 'https://github.com/feenkcom/gtoolkit/issues/' asZnUrl.
znUrl pathSegments fourth

When updating to the latest version of Zinc from this repository, this behaviour changes. A character is returned instead. Screenshot 2021-06-01 at 17 49 54

We are not sure if this change was on purpose?

Also if there is a path segment then we get a ByteString: Screenshot 2021-06-01 at 17 54 08

So one cannot assume pathSegments returns a collection of Strings.

Jun 01 '21 15:06 chisandrei

It was intentional

https://github.com/svenvc/zinc/commit/d962746ae9688550d0c826b9103018a87789afb1#diff-47063f74b25c99af0ddf4f0840769d07aca3728dabb14c78d1f741f685d75235

but I can't immediately remember why we did this, maybe @noha can ...

Jun 01 '21 16:06 svenvc

Yes, I'm a mail archiver. The point was that / is valid as trailing slash and that had the problem that

'http://foo.bar.com/p/' asZnUrl / #otherSegment

gave

http://foo.bar.com/p//otherSegment

(not the double slash). You can see this up to pharo7. Now when you do it you get only a single slash how it should be

Jun 01 '21 16:06 noha

Also, Andrei, a URL that ends with a slash is fundamentally different from one that does not. See ZnUrl>>#isDirectoryPath

Maybe #pathSegments exposes a bit too much about the internal implementation. From this perspective, #path might be a better accessor.

What is it that you want to do ?

Jun 01 '21 18:06 svenvc

thanks @svenvc and @noha for the details. I though there was a reason but I did not find that commit.

The code that failed in our case was something like below. On a url that has four segments we were checking if the fourth segment was a number, to validate a url like https://github.com/user/repo/issues/1234 and invalidate https://github.com/user/repo/issues/.

(NumberParser on: anUrl pathSegments fourth) 
		failBlock: [ isValid := false. ^ self ];
		nextNumber

Now I'll check explicitly for / since I see it can be a valid path segment.

Jun 01 '21 18:06 chisandrei

Hello @svenvc and @noha. I was just randomly reading a URI RFC and accidentally stumbled across a path segment grammar. https://datatracker.ietf.org/doc/html/rfc3986#section-3.3

According to the grammar, a segment can not contain slash / but can be an empty string.

path-absolute = "/" [ segment-nz *( "/" segment ) ]
...
segment       = *pchar

According to RFC, the url from your example should be parsed so that it contains two path segments: p and a trailing empty segment.

Here are test cases that exemplify this behavior: https://github.com/uriparser/uriparser/blob/1762d5ff025fb07b4b8ccd1a8a9635009b2e9e34/test/test.cpp#L112

Have you considered implementing a separate RFC conform URL parser (with AST nodes), having that there is a grammar available?

### References ABNF for URI: https://datatracker.ietf.org/doc/html/rfc3986#appendix-A

Jun 01 '21 19:06 syrel

Hi Andrei,

I would like to keep this issue open.

I am not saying that I want to change the current implementation, but you do have a point. I reread the RFC you mentioned.

It strikes me that maybe we could replace the current $/ marker by an empty segment: this would keep the current design, but might result in a less surprising output of the path segments accessor. But I have to think about this more and hopefully I find the time to test it out.

Thanks for the feedback.

Sven

PS: Yes, a parser based on the official RFC ABNF might also be a good idea, but that you be a separate project first.

Jun 02 '21 12:06 svenvc

Hi Sven. I opened the issue again. Thanks for looking into this!

Jun 02 '21 12:06 chisandrei

About the idea of “a parser based on the official RFC ABNF”: the PetitParser grammar I gave in issue 100 might serve as a starting point.

May 05 '23 06:05 Rinzwind

zinc zinc copied to clipboard

#pathSegments returns segments that are either Character or ByteString

zinc
zinc copied to clipboard