rascal icon indicating copy to clipboard operation
rascal copied to clipboard

Extend `ISourceLocationInput::list` with information if it's a directory or a file

Open DavyLandman opened this issue 2 years ago • 14 comments

Is your feature request related to a problem? Please describe. Currently the list feature only returns a array of strings (the entries), however if we would return a list of [string, filetype] the REPL would be quicker in autocompleting locations. see: BaseRascalREPL.java:450. It's also relevent for the recursive watch feature.

Describe the solution you'd like Change list() to return both the string and the file type.

Note that most api's to read a directory tend to have this information availalbe.

DavyLandman avatar May 16 '22 13:05 DavyLandman

Is there any evidence that simply calling isDirectory is not just as fast? the Java file system API needs a separate call as well (we call String[] File.list() at the moment). Which other providers would be able to optimize on this?

jurgenvinju avatar May 16 '22 15:05 jurgenvinju

Jar and zip come to mind. Also the vscode JSON interface would benefit. We have had to build a specific cache for zips and jars just to make this a bit fast.

DavyLandman avatar May 16 '22 17:05 DavyLandman

File::listFiles() can used, that has both information in there as well (it returns a array of Files).

DavyLandman avatar May 16 '22 19:05 DavyLandman

Performance wise, the native API gives you that information when asking the list entry of a directory. So in essence, some features (like crawl/recursive watch/repl auto complete) can get a more optimal implementation by reducing roundtrips to the resolver. For VSCode I'm now adding again a cache around this, since it's just noticeably slow.

DavyLandman avatar May 16 '22 19:05 DavyLandman

hmm, reading some source code of the java standard library, it doesn't appear to offer that performance benefit for the regular File class. it's all separate calls to the kernel.

DavyLandman avatar May 16 '22 20:05 DavyLandman

  • If this has benefits, then it means we have a slow resolver anyway (let's say in an unindexed jar file), and we'd better resolve that then hide it behind this new tupling mechanism.
  • we can not/should not change the API of list, but rather have to add a core API method
  • but I'd rather keep the core API for the resolvers and the registry as small as possible.

jurgenvinju avatar May 17 '22 08:05 jurgenvinju

If this has benefits, then it means we have a slow resolver anyway (let's say in an unindexed jar file), and we'd better resolve that then hide it behind this new tupling mechanism.

Can you explain what you mean? There are a bit too many negations here.

Most cases of our slow resolvers are fixed by caches (I just had to build another one for vscode VFS). Looking at the traces, in our code make certain assumptions about crawling directories, that an isDirectory is cheap for example. Or similarly, instead of having a single method that gives you the stat of a file/directory, we have 4 functions that return: size/modification/access/filetype. If we want to reduce IO operations, these are candidates.

I logged it as an improvement. Not necessarily a blocker right now, since another cache solved it, but the API has some limitations that we could explore.e

DavyLandman avatar May 17 '22 08:05 DavyLandman

  • what I mean is that if we have missed implementing a cache then we should fix that instead of tupling list and isDirectory.
  • if stat and list need to be optimized, then this is a different project with a big impact on the source code in Rascal and in Java, so definitely something not for this PR.
  • stat was flattened because we did not see a good way to implement the entire stat function for all providers always and "right now". Merging all features into one stat function would have generated a data-structure with optionals or nulls to represent the missing data. So that's the reason we flattened it out. It was never an issue before, most people need only one function at a time apparantly.

jurgenvinju avatar May 17 '22 09:05 jurgenvinju

I just want to point out, that this is not a PR, this is an issue, aimed for this discussion.

DavyLandman avatar May 17 '22 09:05 DavyLandman

:-) good point! got confused.

jurgenvinju avatar May 17 '22 09:05 jurgenvinju

Unless we have measured this to be a real performance issue, I'd rather close it and focus on other things. Makes sense?

jurgenvinju avatar May 17 '22 09:05 jurgenvinju

Well, it is a real performance issue, but I solved it with another cache. I'm fine with that, but if we want to have less caches, this interface has to change. But if you rather not go that direction, I'm okay with closing it.

Personally I would prefer to drop all those caches and make the IO stuff a bit richer, such that crawl and friends can be faster. But if that's not a current focus, we can come back later.

DavyLandman avatar May 17 '22 09:05 DavyLandman

I think we have to think about this slowly and carefully. Out loud:

  • The JSON-bridged filesystems are a thing on their own but they should not complicate all the other filesystems. Java File.list does not offer the tupling of stat features, neither does IResource.list in Eclipse. So the most bare-boned file interfaces don't have it. On that level, when you have the file handle already, it does not seem to be an issue.
  • On the other hand, we always have resolve everywhere before we have a hard handle on a file and before we can do something useful, and that is a problem everywhere in our system, not just between calls to list and isDirectory.
  • URI resolution is inside of the black boxes of ISourceLocationInputResolver implementations.

So my proposal would be to look at how we can keep the results of resolution between functions called on URIResolverRegistry and ISourceLocationInputResolver on the same location quickly after each other. That would service more API methods than just list and crawl, and it might prevent changing very old and stable API.

jurgenvinju avatar May 17 '22 09:05 jurgenvinju

We'd have to think how to do this without breaking the function abstraction in Rascal IO.

jurgenvinju avatar May 17 '22 09:05 jurgenvinju