rascal
rascal copied to clipboard
Add scheme-dependent and file-system dependent URI normalization to URIResolverRegistry
Is your feature request related to a problem? Please describe.
There are many reasons for aliases in source locations:
- soft and hard links (symlinks)
- repeated mounts of the same filesystem
- case insensitivity
- etc.
These are semantic properties of file systems, not syntactic. It means that you have to have an actively running filesystem with a file on it, to be able to know what the aliases are and how they might be normalized.
Loc aliases are detrimental to downstream analysis in Rascal as loc are pretty much always used
as identities.
Describe the solution you'd like
I'd like an additional method to URIResolverRegistry: normalize(ISourceLocation x),
which would be implemented by dispatching to ISourceLocationInput::normalize(ISourceLocation x) via the scheme,
and then making this available to Rascal users via loc Location::normalize(loc l).
This way the user is able to fix possible issues with aliasing easily, without having to consider every different way files could be aliases. Also they are not forced to use it.
Maybe normalize should also replace logical schemes by physical schemes (since that is also a source of aliases). But the jury is still out on this.
Describe alternatives you've considered
There is something to be said for normalizing add location creation time, however there is not always
a file system available to normalize against. So this is impossible. It's better to let source locations remain
purely syntactical, and leave it to a normalize function to deal with the semantics of aliases.
Additional context
Typically people run into these things with case sensitive file systems, but there are many ways to alias files. The more we use Rascal for IO, and on different systems with different OSes and file systems, the more often we run into these issues.
Bad news
Implementing normalization is a lot of detailed research work for each scheme.
Good news
We might implement a default that does nothing, and start incrementally adding normalization. If we start with the file scheme, then we quickly saturize at 80% of all the schemes.