filepath icon indicating copy to clipboard operation
filepath copied to clipboard

Improve normalise

Open ndmitchell opened this issue 10 years ago • 17 comments

When reviewing the Shake normalise function, I realised the implementation I had was all wrong. I should backport the appropriate Shake tests to the FilePath library and see if that also needs improvements.

ndmitchell avatar Oct 19 '14 14:10 ndmitchell

I had a brief look at the tests in https://github.com/ndmitchell/shake/blob/master/Test/FilePath.hs#L33-L62

Notable discrepancy: Shake.FilePath: norm "a/." === "a" System.Filepath: normalise "a/." == "a/"

Special handling of trailing dots was added to System.FilePath in 3815668a4ad7f10f864ad39a6231674a7faf8521.

thomie avatar Oct 27 '14 19:10 thomie

Looking at the trailing slash behaviour of normalise, I'm not certain what the right answer should be. Is going from trailing slash to not too much change for normalise?

ndmitchell avatar Oct 27 '14 21:10 ndmitchell

Since normalise "a/" == "a/", I think normalise "a/." should also be "a/", not "a". I don't quite understand the Shake version, maybe it should be changed to keep the trailing slash.

thomie avatar Oct 27 '14 21:10 thomie

Quite possibly, I'd like the Shake and non-Shake versions to end up equivalent where they overlap - and some changes to the Shake version is fine.

ndmitchell avatar Oct 27 '14 21:10 ndmitchell

So I think my underlying uneasiness all comes down to what a/. should be represented as. Essentially, you can chose to preserve the presence/absence of a trailing slash (and go for a), or preserve knowledge that it's a directory (and go for a/). What worries me is we don't have a function to test if something appears to be a directory or not, but we do have functions to talk about trailing slashes - perhaps that was a design flaw at the very beginning.

In Shake I jumped for preserving the presence/absence of a trailing slash, in FilePath we seem to be preserving directory information. I'm still trying to figure out in my head which is better.

ndmitchell avatar Nov 10 '14 21:11 ndmitchell

There are lots of things here that aren't quite right, comparing to the Shake version (which also has a few minor trailing slash and drive issues). A few examples on Windows.normalise:

  • ///// is \\\\\ (literally, not escaped, so 5 instances of \) - should probably be \
  • /a: is a:, should probably be unchanged, but this breaks the idempotence - since a: renormalises to A:.
  • /a:/ is a:/ - should probably be \a:\.

I think the "right" solution is probably to rethink the whole library to be based around:

data Lexeme = Drive Char | UNC String | Separator Char | Path String

parse :: String -> [Lexeme]

display :: [Lexeme] -> String
display = concatMap $ \x -> case x of
    Drive x -> [x,':'] -- x will be an ASCII character
    UNC x -> '\\':'\\':x -- x will be a UNC name, not containing any pathSeparators
    Separator x -> x -- x will be a member of pathSeparators
    Path x -> x -- x will not contain any pathSeparators

Where we rely on the property display . parse == id. Writing normalise is then pretty trivial, and things like changing combine as per #8 would become feasible.

Does anyone (e.g. @thomie) think that seem sensible? Or a bad idea? Or too disruptive?

ndmitchell avatar Nov 21 '14 16:11 ndmitchell

Those are all invalid filepaths:

*System.FilePath.Windows> isValid "/////"
False
*System.FilePath.Windows> isValid "/a:"
False
*System.FilePath.Windows> isValid "/a:/"
False

I also get a different result for this one:

*System.FilePath.Windows> normalise "/////"
"\\\\\\\\\\"

But I like your idea. splitDrive is particularly brittle at the moment. We could take it even further: reject invalid filepaths from being parsed. Not sure what the implication would be, but can't we make filepath a wrapper around system-filepath?

thomie avatar Nov 21 '14 17:11 thomie

By \\\\\ I meant it printed \\\\\\\\\\ - I was removing the escape characters to print it literally.

Since the filepath library is one of the GHC Core libraries it's very important that we don't break reverse compatibility other than for bugs, so we'll have to support invalid paths forever (I personally think that's a good thing, in certain circumstances things which would otherwise be invalid can be useful to do some subset of operations on). We also can't depend on system-filepath, since that isn't a Core library (it also does a few things I'm not a fan of). I'm not suggesting we expose the parse/display functions or Lexeme data type.

ndmitchell avatar Nov 21 '14 17:11 ndmitchell

By \\ I meant it printed \\\ - I was removing the escape characters to print it literally.

Ok, I was just confused for some reason. Still, it's not a valid filepath, so normalise can do whatever.

Even core libraries can be changed, after an accepted proposal on the libraries list, but I see what you're saying.

It would be nice if the section "Should FilePath by an abstract data type?" from the old readme would come back, with an explanation why invalid paths can sometimes be a good thing.

thomie avatar Nov 21 '14 17:11 thomie

Agreed, I should resurrect that section, but make it up to date (before it was all about proposals, and before system-filepath existed).

ndmitchell avatar Nov 21 '14 18:11 ndmitchell

I've updated the README with an abstract filepath section. Work still needed on normalise.

ndmitchell avatar Dec 22 '15 10:12 ndmitchell

Idea is to do this after #53.

ndmitchell avatar Dec 23 '15 13:12 ndmitchell

Should C:\\\\ be normalised to C:\? Windows API (e.g. CreateFile) seems treat them the same.

Rufflewind avatar Mar 05 '17 05:03 Rufflewind

@Rufflewind - yes, probably. That said, there are plenty of holes in normalise that could do with being addressed...

ndmitchell avatar Mar 06 '17 21:03 ndmitchell

> normalise "\\??????????????\\D:\\lol"
"D:\\lol"
> normalise "\\abc\\D:\\lol"
"D:\\lol"

both paths are invalid, yet normalise makes them valid.

Maybe normalise should be a no-op for invalid paths.

hasufell avatar Nov 04 '21 18:11 hasufell

Should C:\\\\ be normalised to C:\? Windows API (e.g. CreateFile) seems treat them the same.

Just hit this issue today. Could we possibly fix at least this inconsistency, if a larger rewrite is not on cards?

Another helpful option would be at least to document known caveats with existing normalise.

Bodigrim avatar Jan 05 '23 20:01 Bodigrim

Normalise is a mess. I'm not even sure what are all the expected properties.

Rewriting it in-place isn't possible. We'll have to rename the function and deprecate the old.

hasufell avatar Jan 06 '23 00:01 hasufell