rascal
rascal copied to clipboard
Extend `ISourceLocationInput::list` with information if it's a directory or a file
Is your feature request related to a problem? Please describe.
Currently the list
feature only returns a array of strings (the entries), however if we would return a list of [string, filetype]
the REPL would be quicker in autocompleting locations. see: BaseRascalREPL.java:450. It's also relevent for the recursive watch feature.
Describe the solution you'd like
Change list()
to return both the string and the file type.
Note that most api's to read a directory tend to have this information availalbe.
Is there any evidence that simply calling isDirectory is not just as fast? the Java file system API needs a separate call as well (we call String[] File.list()
at the moment). Which other providers would be able to optimize on this?
Jar and zip come to mind. Also the vscode JSON interface would benefit. We have had to build a specific cache for zips and jars just to make this a bit fast.
File::listFiles() can used, that has both information in there as well (it returns a array of File
s).
Performance wise, the native API gives you that information when asking the list entry of a directory. So in essence, some features (like crawl/recursive watch/repl auto complete) can get a more optimal implementation by reducing roundtrips to the resolver. For VSCode I'm now adding again a cache around this, since it's just noticeably slow.
hmm, reading some source code of the java standard library, it doesn't appear to offer that performance benefit for the regular File
class. it's all separate calls to the kernel.
- If this has benefits, then it means we have a slow resolver anyway (let's say in an unindexed jar file), and we'd better resolve that then hide it behind this new tupling mechanism.
- we can not/should not change the API of
list
, but rather have to add a core API method - but I'd rather keep the core API for the resolvers and the registry as small as possible.
If this has benefits, then it means we have a slow resolver anyway (let's say in an unindexed jar file), and we'd better resolve that then hide it behind this new tupling mechanism.
Can you explain what you mean? There are a bit too many negations here.
Most cases of our slow resolvers are fixed by caches (I just had to build another one for vscode VFS). Looking at the traces, in our code make certain assumptions about crawling directories, that an isDirectory
is cheap for example. Or similarly, instead of having a single method that gives you the stat
of a file/directory, we have 4 functions that return: size/modification/access/filetype. If we want to reduce IO operations, these are candidates.
I logged it as an improvement. Not necessarily a blocker right now, since another cache solved it, but the API has some limitations that we could explore.e
- what I mean is that if we have missed implementing a cache then we should fix that instead of tupling list and isDirectory.
- if stat and list need to be optimized, then this is a different project with a big impact on the source code in Rascal and in Java, so definitely something not for this PR.
- stat was flattened because we did not see a good way to implement the entire stat function for all providers always and "right now". Merging all features into one stat function would have generated a data-structure with optionals or nulls to represent the missing data. So that's the reason we flattened it out. It was never an issue before, most people need only one function at a time apparantly.
I just want to point out, that this is not a PR, this is an issue, aimed for this discussion.
:-) good point! got confused.
Unless we have measured this to be a real performance issue, I'd rather close it and focus on other things. Makes sense?
Well, it is a real performance issue, but I solved it with another cache. I'm fine with that, but if we want to have less caches, this interface has to change. But if you rather not go that direction, I'm okay with closing it.
Personally I would prefer to drop all those caches and make the IO stuff a bit richer, such that crawl and friends can be faster. But if that's not a current focus, we can come back later.
I think we have to think about this slowly and carefully. Out loud:
- The JSON-bridged filesystems are a thing on their own but they should not complicate all the other filesystems. Java
File.list
does not offer the tupling of stat features, neither does IResource.list in Eclipse. So the most bare-boned file interfaces don't have it. On that level, when you have the file handle already, it does not seem to be an issue. - On the other hand, we always have
resolve
everywhere before we have a hard handle on a file and before we can do something useful, and that is a problem everywhere in our system, not just between calls tolist
andisDirectory
. - URI resolution is inside of the black boxes of
ISourceLocationInputResolver
implementations.
So my proposal would be to look at how we can keep the results of resolution
between functions called on URIResolverRegistry
and ISourceLocationInputResolver
on the same location quickly after each other. That would service more API methods than just list and crawl, and it might prevent changing very old and stable API.
We'd have to think how to do this without breaking the function
abstraction in Rascal IO.