github-api Add symlink support

Describe the bug

if a symlink exists in the repository it does not seem possible to resolve it using the current API methods

To Reproduce Steps to reproduce the behavior:

create a repo
create a directory dir
create a file file in dir with contents hello world
create a symlink dir-link pointing to dir
create a symlink file-link pointing to dir/file
use the api to try and resolve dir-link or file-link

Expected behavior

There is an API that can be used to resolve the symbolic links to the actual files

Desktop (please complete the following information):

OS: N/A
Browser N/A
Version N/A

Additional context

https://issues.jenkins-ci.org/browse/JENKINS-62922

Jul 02 '20 11:07 jtnord

target is the underlying location... https://developer.github.com/v3/repos/contents/#response-if-content-is-a-symlink

Jul 02 '20 11:07 jtnord

note: if the symlink is to a file and in the same directory then getContent works. if the path contains a symlinked directory then it does not.

Jul 02 '20 15:07 jtnord

The problem here is that the path ghcontent-ro/a-symlink-to-a-dir/entry-one does not actually exist in the repo. It can be reached by traversing the symlink, but as far as git and github are concerned, the path is not there.

The behavior of the API for getting repository contents is kind of all over the place:

If the path you request points to a file, you get that file's record.
If the path you request points to a directory, you get an ARRAY of file records for the files in that directory. It would be much better to get the directory's record with a "children" field containing an array
If the path you request points to a symlink AND the target is a file, you get the target file's record.
If the path you request points to a symlink AND the target is not a file, you get the symlink's record.

The only way to use directory symlinks would be to take the path and traverse it one element at a time looking for symlinks and changing the path to request the targeted path. That would result in one request per path element which is painfully costly.

Jul 02 '20 22:07 bitwiseman

Thinking about the least costly way to do this:

Request a path. IF success, return.
If 404: a. If parent directory is \, return 404. b. Request the parent directory. c. If parent directory is symlink, replace with target and goto 1. c. If parent directory is directory, return 404 c. If 404, set parent directory to parent of current parent, goto 2.a.

This could be done in a bisecting fashion to keep the number of requests down. Even so, the cost for any 404 would go from one request to log(n) requests - there's no way to tell if the 404 is real or caused by a symlink. That would mean every file content 404 would suddenly start causing multiple requests.

We could make it an optional behavior, maybe a new API method.

Jul 02 '20 23:07 bitwiseman

@jtnord

Hm, doing a bit more searching, I see there's a "trees API". It doesn't traverse directory symlinks either, but it could be used to find symlinks with fewer requests. It has a recursive option that would let us quickly get a flat list of the directory tree.

https://api.github.com/repos/hub4j-test-org/GHContentIntegrationTest/git/trees/cc7e26f850339a8e8427fa2d983ca6006ad1a78c?recursive=1

That query can return a large number of records which might make it slow. It may truncate if there are too many. Truncation could be handled by querying again inside a subtree. Looks like symlinks are blobs just like files, but they have a different mode.

This would reduce the added cost for general 404's to a much smaller number of querie, probably only 1 or 2 in most cases. That wouldn't be so bad. Traversing to symlinks would be the same cost - an initial 404 followed by a remapping. The tree could be cached on the GHRepository instance maybe?

Still probably not something we'd want turned on by default, but I'm open to discussion.

Jul 02 '20 23:07 bitwiseman

If the path you request points to a symlink AND the target is a file, you get the target file's record.

iff the target file exists within the bounds of the repo, otherwise the symlink record :)

I concur that doing this by default in GHRepository.getFileContent(String) is probably not the best use of API token calls due to rate limiting. but I think maybe another function that callers can use GHRepository.getFileContent(String path, boolean traverseSymLinks) could be useful (and make the former call the latter with a default false maybe set by a global static/system property)?

Jul 03 '20 10:07 jtnord

@jtnord That sounds reasonable.

Also, instead of caching at the object level, we could depend on okhttp caching to reduce rate limit usage while also accurately updating if the tree updates.

Jul 06 '20 19:07 bitwiseman

See #878 for some related discussion around GHTree interactions.

Jul 07 '20 19:07 bitwiseman

Sorry for the bother but perhaps you can help me out here. Do you guys know if symlinks can be made to work on GitHub?

That is, could they forward resource requests to raw.githubusercontent.com to the symlink target? I.e. if I have a repo with a folder called images as well as a symlink called logos pointing to images and some images with unchangeable src:

<img src="https://raw.githubusercontent.com/<user>/<repo>/main/logos/<some-file>.jpg" />

Currently the URL returns "404 not found". Is it possible to return the symlinked file?

Mar 22 '21 14:03 janosh

@janosh Based on the discussion above, yes, it can be done, but it would need to be separate method/option because the behavior requires multiple API calls.

PRs welcome. I'd be happy to answer any questions you have for how to implement the solution suggested above.

Mar 22 '21 17:03 bitwiseman

Sorry, I'm not a Java dev. I contacted GitHub support about the possibility of making this symlink forwarding native functionality. Will report back if anything comes of that.

Mar 23 '21 08:03 janosh