scryer-prolog icon indicating copy to clipboard operation
scryer-prolog copied to clipboard

path_segments/2 does not recognize slash in Windows

Open rotu opened this issue 8 months ago • 23 comments

The path_segments/2 predicate only recognizes the backslash \ in Windows. The forward slash / is also a legal delimiter in Windows.

?- use_module(library(files)).
   true.
?- path_segments("D:\\hello\\there",E).
   E = ["D:","hello","there"].
?- path_segments("D:/hello/there",E).
   E = ["D:/hello/there"].
?- path_segments("D:\\hello/mixed\\path",E).
   E = ["D:","hello/mixed","path"].

rotu avatar May 02 '25 22:05 rotu

The documentation assumes exactly one:

To obtain the platform-specific directory separator, you can use:

?- path_segments(Separator, ["",""]).
   Separator = "/".

UWN avatar May 03 '25 04:05 UWN

The documentation assumes exactly one

Yes, this is documented, but I found it awkward in practice. It means that path parsing diverges substantially from how OS calls will interpret the path, which seems ripe for abuse!

rotu avatar May 03 '25 10:05 rotu

I cannot comment on OS aspects, seems you know more about this.

My only point is that once you admit several separators, you are now questioning the relational nature of path_segments/2. Do you want more than one solution given a list of segments? In fact, rather 2n-1 with n segments.

Either the interface has to be reconsidered (or multiple definitions offered), or you opt for a non-relational definition. A good example for such a definition which as a whole is not a relation but still in parts, is number_chars/2.

Then, does it make sense to have empty segments? Under Linux, I believe, this is not possible

ulrich@gupu:~$ touch ''
touch: cannot touch '': No such file or directory

UWN avatar May 03 '25 12:05 UWN

My only point is that once you admit several separators, you are now questioning the relational nature

Hoo boy... I don't quite understand what you mean by "questioning the relational nature". Please clarify if this does not answer:

The most natural resolution is that, while path_segments/2 should split on either slash, it should only ever join using the system-specific slash. That is a asymmetry, but it seems benign to me.

It might make sense instead to use an internal representation of a path with a list of segments, irrespective of delimiter, and provide functions to convert these to/from OS paths. It also might make sense to hash out these details first for URL manipulation (which is harder to fit into a "list of segments" model since you have details like protocol, query parameters, etc).

Then, does it make sense to have empty segments?

The path API must distinguish rooted versus unrooted paths (/tmp vs tmp) which currently are distinguished by an initial empty segment. So right now, initial empty segments are a "load-bearing implementation detail".

Consecutive slashes are generally treated as if they were a single slash (foo//bar is the same as foo/bar) in most settings. So either empty internal segments can be ignored, normalized to ., or left as-is. I don't know what the right design choice is.

rotu avatar May 04 '25 07:05 rotu

That is a asymmetry, but it seems benign to me.

Benign... Given both as you said, it is non-relational. Assuming the preferred is the backslash:

?- P = 'a/b', path_segments(P, [a,b]).
   P = 'a/b'.
?-             path_segments(P, [a,b]), P = 'a/b'.
   false.
?-             path_segments(P, [a,b]).
   P = 'a\\b'.

So this is not in the pure, monotonic part. Not a relation.

UWN avatar May 05 '25 16:05 UWN

@triska, the files library seems like it should either be operating-system independent and operate purely on platform-independent paths (requiring / and maybe even throwing an error if \ is found) or it should be operating-system-dependent and split paths into their logical components.

Neither approach seems obviously better. Do you have an opinion on which would be the better way for me to proceed?

rotu avatar May 08 '25 17:05 rotu

How commonly is a slash used on Windows as directory separator? This is the first time I hear about this.

library(files) uses Rust functionality at its base, and Rust mentions only \ for Windows: https://doc.rust-lang.org/std/path/constant.MAIN_SEPARATOR.html

library(files) should work well together with for example library(os) where paths are also used, for instance in shell/N.

One way out may be a single predicate that normalizes file paths so that exactly one directory separator character is used.

triska avatar May 08 '25 18:05 triska

Both slash and backslash are explicitly supported in Windows paths. They are synonymous as directory separators, but of course, paths on Windows can be a bit tricky.^0

library(files) uses Rust functionality at its base, and Rust mentions only \ for Windows: https://doc.rust-lang.org/std/path/constant.MAIN_SEPARATOR.html

Well, that's why it calls it MAIN_SEPARATOR, not simply SEPARATOR.^1

It's very common on Windows for APIs to accept but not produce /.^2 I would be inclined to use forward slash especially if writing my logic in a platform-agnostic way. It turns out, using quoted lists in Prolog is a pain when you need to use \ so often.

One way out may be a single predicate that normalizes file paths so that exactly one directory separator character is used.

That's a good idea, but I'm not sure if you're talking about normalizing to the host platform or to a platform-independent representation. (And note that there are some features that don't fit into any corresponding Linux concept. e.g. UNC paths on file shares e.g. drive-relative paths - it may be okay to error on these, at least for now).

It's going to be a problem if OS-specific and OS-independent paths are easily confused (because they're both just lists of atoms), or if you're checking if a path is inside a certain public directory but a malicious user crafts some/path/../like\../this.

rotu avatar May 08 '25 20:05 rotu

not sure if you're talking about normalizing to the host platform

That's what I meant. Normalize to use what Rust calls the main separator, on the assumption that Rust has thought this through already and this concept appears sensible.

triska avatar May 08 '25 20:05 triska

not sure if you're talking about normalizing to the host platform

That's what I meant. Normalize to use what Rust calls the main separator, on the assumption that Rust has thought this through already and this concept appears sensible.

Rust doesn't normalize paths? It keeps them in a string-like Struct Path and exposes some platform-specific functionality like has_root and components that produce the platform-correct result.

There is a canonicalize function which, I think, throws an error if the path doesn't exist.

There is also a crate to normalize paths: https://docs.rs/normalize-path/latest/normalize_path/

rotu avatar May 08 '25 21:05 rotu

There is no good concept of "platform independent path string", that's why Rust uses an "opaque struct" to represent this. This is also the case in other languages like Python. Also, notice that paths can contain non UTF-8 bytes (that's why Rust has OsString), so strings are really the wrong thing to use here fundamentally.

I think we could use some specialized term to represent a path (analogous to the opaque struct in Rust, but not exactly because terms can't be opaque in Scryer Prolog), and then have some predicates to manipulate that or turn it from and into strings if possible, in a design like Python's pathlib. However that may be a bit too opinionated and we may want to look at how other Prolog systems deal with it instead.

bakaq avatar May 08 '25 22:05 bakaq

Also, it seems you use Windows @rotu, is that right? Scryer Prolog has a real lack of support on Windows because we have basically no recurrent contributor that uses Windows to actually recognize, let alone fix these compatibility problems. Any help with issues like this and especially PRs addressing these problems is very appreciated.

bakaq avatar May 08 '25 22:05 bakaq

Yes, I sometimes use Windows, but I prefer MacOS and use it more often.

Paths don't need to be completely opaque terms; most of the time, they can indeed be spelled as strings. But I still don't like lists as strings, even though the syntax is convenient. (As far as I understand - I'm still at the beginning of my Prolog journey) the list-of-chars has a "rootward bias" which is logically limiting. E.g. the glob "/var/**" is ['/',v,a,r,'/'|X], but the glob "**/file" can't be spelled as a term.

I like using Python's pathlib. The / operator is really nice and idiomatic in practice, especially because relative paths have a nice monoidal structure with '.'. Maybe this warrants a hybrid approach: have a path be either a list of chars or a term with principle functor (/)/2 and two paths as arguments.

If you need to flatten a sufficiently-instantiated path for whatever reason, it could be done with a dedicated procedure like path_to_chars/2 or, better yet, done by the consuming function. When you need to actually make a system call, convert it just in time. That allows things like FullPath = (BasePath / "path/to" / Filename), file_exists(FullPath). It neatly allows you to not worry about the path separator, just how path fragments logically compose.

rotu avatar May 09 '25 00:05 rotu

[...] most of the time, they can indeed be spelled as strings.

This ignores the important point I brought up that paths are in general not UTF-8 (see Rust's OsString for explanation), while Scryer's strings are always UTF-8. The actual natural internal representation here is a list of bytes, but Scryer doesn't have good support for that like UTF-8 strings. We could just not support non UTF-8 paths, and many systems do that, but that should be documented and be an intentional choice.

[...] E.g. the glob "/var/" is ['/',v,a,r,'/'|X], but the glob "/file" can't be spelled as a term.

This is true, it can't be written as a (single) list term, but it can still be described by DCGs or append/3. You could do a "double ended list" (like Prefix-"/file") to reify this to a term if you like, and only convert it to a partial list with append(Prefix, Suffix, List) if the prefix is ground or you want all the alternatives). Even better for this case, you could just do a list of segments or some other more advanced description of lists of segments with DCGs. I don't thing using variables to represent globs is a good idea though.

I completely agree with your last paragraph, but I'd like to point out that FullPath = (BasePath / "path/to" / Filename) assumes that the internal representation is something that uses (/)/2 as a separator and strings as segments. I don't think using (/)/2 instead of lists is a good idea, for the same reasons that using (',')/2 instead of lists isn't a good idea. I also think it would be better to keep this mostly hidden, and provide predicates to work on it instead, in this case something like path_append_segments(Path, ["path", "to", Filename], FullPath).

bakaq avatar May 09 '25 00:05 bakaq

[...] most of the time, they can indeed be spelled as strings.

This ignores the important point I brought up that paths are in general not UTF-8 (see Rust's OsString for explanation), while Scryer's strings are always UTF-8. The actual natural internal representation here is a list of bytes, but Scryer doesn't have good support for that like UTF-8 strings. We could just not support non UTF-8 paths, and many systems do that, but that should be documented and be an intentional choice.

No, I'm not ignoring it; I'm abstracting over it! if you treat path fragments as an opaque building block and know that they can be joined if need be, you don't need to worry about that. User logic can be generic over that detail.

[...] E.g. the glob "/var/" is ['/',v,a,r,'/'|X], but the glob "/file" can't be spelled as a term.

This is true, it can't be written as a (single) list term, but it can still be described by DCGs or append/3. You could do a "double ended list" (like Prefix-"/file") to reify this to a term if you like, and only convert it to a partial list with append(Prefix, Suffix, List) if the prefix is ground or you want all the alternatives). Even better for this case, you could just do a list of segments or some other more advanced description of lists of segments with DCGs. I don't thing using variables to represent globs is a good idea though.

I don't mean that actually using variables for globs is a good idea! I was using glob syntax to refer to the concept of a path fragment in an intuitive way. Maybe using the / operator is unwise and a little too magical, but I think it's better to use a different functor here for appending path segments versus string-building.

I don't understand double-ended-lists well enough yet to understand how that would look. It might well be a better construct here. What I think I'm describing is preferring a tree representation on the user side and turning it into a list only when necessary. This might be DCGs "held from the wrong end"; when I understand DCGs, I'll know!

I completely agree with your last paragraph, but I'd like to point out that FullPath = (BasePath / "path/to" / Filename) assumes that the internal representation is something that uses (/)/2 as a separator and strings as segments. I don't think using (/)/2 instead of lists is a good idea, for the same reasons that using (',')/2 instead of lists isn't a good idea. I also think it would be better to keep this mostly hidden, and provide predicates to work on it instead, in this case something like path_append_segments(Path, ["path", "to", Filename], FullPath).

You misunderstand me. In Python, pathlib uses / as an OS-agnostic path-joining operator. Note too that internal forward slashes get eventually normalized:

>>> import pathlib
>>> print(pathlib.PureWindowsPath("C:/") / "path/to" / "filename")
C:\path\to\filename

As a note, you also might not even notice which slashes pathlib uses if you don't look carefully. Because backslashes are unwieldy and must be escaped, Python prefers to represent them as forward slashes with repr and the platform-specific divider in str:

>>> pathlib.PureWindowsPath("C:/") / "path/to" / "filename"
PureWindowsPath('C:/path/to/filename')

rotu avatar May 09 '25 02:05 rotu

You misunderstand me. In Python, pathlib uses / as an OS-agnostic path-joining operator. Note too that internal forward slashes get eventually normalized:

I do understand that, I've used pathlib a lot and I think it's a great approach! I was criticizing this as an interface here in Prolog. Remember that Thing = (A / B) unifies Thing with the term /(A, B). It isn't a function that "returns" something and then gets "assigned" to Thing. Therefore this only actually works if the internal representation of the joining of two paths is literally /(A, B); no extra metadata, no conversion of the strings to a better internal thing, no rebalancing of the tree, etc...

We could use / as a predicate, like /(A, B, Thing) instead, but that seems very wonky. I also wasn't criticizing using / because it's "not OS-agnostic", but just because using a term like that to implement what is best expressed as a list is bad (I could explain why if you want, it's the same reason that [a,b,c] is better than (a,b,c)).

bakaq avatar May 09 '25 04:05 bakaq

Therefore this only actually works if the internal representation of the joining of two paths is literally /(A,B).

Yes, the representation is redundant. So for instance (A / B) / C would be different than A / (B / C) but they'd represent the same logical path. This is a little ugly but it also allows you to avoid unnecessary copies and to instantiate parts of the path in any convenient order.

This is similar to building up a numeric expression, with the understanding that you still have to evaluate it at the end if you want a numeric answer. (One difference being, evaluation here requires platform-specific knowledge and maybe even filesystem access, if you want to resolve symlinks or tell whether a path exists).

Even if you want to normalize the path, doing so early is awkward. You can't even know the number of segments in the normalized path, since some path fragments may be . or .. or even contain a directory separator.

Am I right that you think a list is more natural because / is associative so the tree grouping is redundant? Or does it have to do with the list terminator?

rotu avatar May 09 '25 06:05 rotu

have a path be either a list of chars or a term with principle functor (/)/2

I understand the temptation of defaulty structures, along the lines of "so easy to compose". That's true. The cost is disproportionate though: Reasoning tasks that ought to be simple (such as: what is the next segment in the path?) become complex. A defaulty representation spreads this cost across all predicates that reason about it.

library(files) is designed to use clean data structures, and avoid defaulty structures.

triska avatar May 09 '25 16:05 triska

have a path be either a list of chars or a term with principle functor (/)/2

I understand the temptation of defaulty structures, along the lines of "so easy to compose". That's true. The cost is disproportionate though: Reasoning tasks that ought to be simple (such as: what is the next segment in the path?) become complex. A defaulty representation spreads this cost across all predicates that reason about it.

The question "what is the next segment in the path" is extra-logical - it CANNOT be written in an environment-agnostic way until the path is processed in an environment-sensitive manner.

In shell scripting and most programs I've written, usually you don't reason about paths; you assemble paths out of chunks, which may contain environment variables or ~ which are deferred to be expanded by the shell.

Any predicate that tries to convert the path without using something like a well-written path_to_string_lossy/2 or path_to_segments/2 probably contains mistakes or will wind up being tied to operating-specific assumptions (like, e.g. "the path is a string" or e.g. "the path contains a directory name", or e.g. "the path is valid for this system").

library(files) is designed to use clean data structures, and avoid defaulty structures.

I definitely take your objection that, while this seems a tempting design, perhaps it is too cutesy by half, and a DSL may not be called for.

I have a lot to learn about Prolog, let alone writing good Prolog.

rotu avatar May 09 '25 18:05 rotu

The question "what is the next segment in the path" is extra-logical - it CANNOT be written in an environment-agnostic way until the path is processed in an environment-sensitive manner.

My statement pertained to the representation you suggested where this processing was already presumed or not necessary and we have a representation that makes the structure clear. A defaulty representation such as A / B where it is not clear what A and B are is not a good choice for this.

triska avatar May 09 '25 18:05 triska

The question "what is the next segment in the path" is extra-logical - it CANNOT be written in an environment-agnostic way until the path is processed in an environment-sensitive manner.

My statement pertained to the representation you suggested where this processing was already presumed or not necessary and we have a representation that makes the structure clear. A defaulty representation such as A / B where it is not clear what A and B are is not a good choice for this.

I was thinking of constraining the structure. Something along the lines of:

  1. If X is a list of characters, then X is a path.
  2. If Y is (??? not sure what should go here), then os_string(Y) is a path.
  3. If X and Y are paths, then X / Y is a path.

I gather that 1 should maybe have a different principle functor, e.g. string/1 or path/1, to be less defaulty.

rotu avatar May 09 '25 19:05 rotu

Currently, path_segments/2 uses a list where each element is a path segment. To me, a list seems an ideal representation to represent a sequence of segments. A list can be easily processed and reasoned about, combined, transformed etc. with built-in predicates. Questions such as "What is the last segment?" or "How many segments are there?" can be easily answered.

triska avatar May 09 '25 19:05 triska

Currently, path_segments/2 uses a list where each element is a path segment. To me, a list seems an ideal representation to represent a sequence of segments. A list can be easily processed and reasoned about, combined, transformed etc. with built-in predicates. Questions such as "What is the last segment?" or "How many segments are there?" can be easily answered.

I have a few issues with this. (1) The initial path segment has different meaning than the rest. An empty initial segment is weird (maybe the solution is to admit /, C:\, etc. as initial path segments) (2) The final path segment may end with / or not. That's important in some Linux shell commands; not sure whether to normalize it away or not. (3) A path segment may be . or ... (4) A path segment may contain a slash or backslash. Depending on interpretation that might be an obviously illegal filename or a path segment that obviously still needs to be expanded.

rotu avatar May 09 '25 19:05 rotu