backstage
backstage copied to clipboard
Backstage doesn't recognize branches that contain forward slashes
Backstage relies on git-url-parse to parse URLs. This library has a bug that causes it to incorrectly parse URLs that point to branches with forward slashes in them.
Expected Behavior
I expect to be able to use branches with /
because it's a useful way to organized branches into groups.
Current Behavior
If I attempt to create a template from a branch that uses a /
in its name, the operation will fail because Backstage (via git-url-parse) will assume that I'm trying to get code from branch called tm
. It will ignore everything past first /
.
Possible Solution
🤷♀️
Steps to Reproduce
- Try to create a template from https://github.com/spotify/backstage/blob/dependabot/npm_and_yarn/testing-library/jest-dom-5.11.4/plugins/scaffolder-backend/sample-templates/create-react-app/template.yaml
Context
Some of the projects I'm involved in require that each team member prefixes their branch with their initials. It's a habit at this point to start every branch name with tm/
.
Oof - thanks for reporting, of course this should work and I always do the same.
That library has been a bit less than stellar, I wonder if there are alternatives. Then again, just looking at the url in your example, it looks hard to deduce where the ref ends and the path begins...
I created a runkit that has every GitHub URL parsing library that I could find. None of them support properly detecting the branch name. We might need a completely different strategy for doing this.
https://github.com/spotify/backstage/blob/blob/blob/blob/blob/blob/blob.txt
There is actually a bit of method to this madness though. Git itself has a limitation that you can only create branches that can be represented in the filesystem. For example, you can't have both the branch a/b
and a/b/c
, because when you create a/b
b
is a text file on disk, and therefore you can't create a/b/c
since b
would have to be a folder for that.
With that in mind it means there's at least a deterministic way to interpret the URL, the annoying bit is that it includes using the GitHub
API to discover the branches that are present in the repo.
the annoying bit is that it includes using the GitHub API to discover the branches that are present in the repo
I was thinking the same thing.
There could be another approach that could work but I don't know how universal it wold be.
GitHub website doesn't seem to care about extra forward slashes. We could for example use //
forward slash to represent beginning and end of a ref. It would also be annoying, but could be better than auto detecting because the user is expressing their intention by adding the extra forward slash.
For example, 'https://github.com/spotify/backstage/blob//dependabot/npm_and_yarn/testing-library/jest-dom-5.11.4//package.json' still leads to the same file on the GitHub website.
@taras tbh we want users to be able to just copy the address from the address bar. Otherwise we could just go with our own shorter format too
true true. I'm going to email a friend at Microsoft to see if he can connect us with someone at GitHub. I want to find out from them if there is a way that they'd recommend.
Like I commented in https://github.com/IonicaBizau/git-url-parse/pull/114#issuecomment-706091677, I don't think there is a way to split the url itself into branch name and filename... Pinging @izuzak here just in case he has any ideas how this can be improved. :sparkle:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@mtlewis and I spent a hot second looking at the code in question. As mentioned, there's not enough information in the URL to tell branch name versus path when the branch has a slash. I didn't find any alternative URL formats that work with GitHub where this might be possible.
This code could have some additional logic, though. I think we could either:
- List all branches for the repo to match a substring
- Catch the 404 and retry, expanding the branch name to the next
/
Since (1) could suffer on a repo with a ton of branches, and could have false positives, the second seems preferable to me. We could retry N times, expanding to the next slash each time, where N is a fairly low number.
One issue iirc is that there's a lot of code that just uses git-url-parse or things like resolveUrl that aren't even async in the first place so there's very little room for doing smart or retrying or remote-querying things. Feels to me like it unfortunately would take a big refactoring to make it good.
Evening, I'm not a developer but doing loops to attempt to work out if its a branch or a path seems inelegant.
In other projects I've seen the URL is normally constructed for git repos with parameters that get extracted to let the client libs know what the branch is;
For example;
Terraform
git::https://example.com/vpc.git?ref=v1.2.0
Helmfile
name: remote-repo url: git+https://github.com/repo/test@deploy/helm?ref=master
kustomize
github.com/kubernetes-sigs/kustomize/examples/multibases?ref=v1.0.6
@eperper thanks for those pointers - interesting. One thing to take into account here is convenience of copy+paste as mentioned by @Rugvip above. Having to form a custom url might be error prone. That being said, I think that approach can be worth considering nonetheless, especially if it's an "alternative form" that you can use only when necessary or if it's easily available somehow in the web interface besides the address bar of the browser.
Hi guys, what's a status on this one? Unfortunately it's a TechDocs show stopper for us.. 😢
Sorry but this is still a current issue. There's some discussion in the comments above. Having unencoded slashes in the URL makes it super problematic to parse and understand GitHub URLs - there's no surefire way to actually know where the branch ends and the path begins, without resorting to hammering the GitHub API with queries to resolve the situation - and this needs to be done in a lot of places where the code isn't even async ready, but rather expects to be able to do that parsing instantly and locally.
@freben I like the idea of "alternative form" which would solve the problem but didn't require any groundbreaking changes. I like format where ref
is supplied as a query string. That solution would make it extensible also for other information in future - if needed.
Do you guys at @backstage prefer to implement this yourself? Or is this okay to be done by a community? Do you have any requirements & tips how you would like this to be implemented (and where)?
@FilipPyrek Yeah that's probably the most viable way forward that we have seen so far, at least. We could make sure to implement support for the alternative form as well as the original native form - we do want to keep the convenience of copy+paste from the browser.
There's TONS of direct usages of git-url-parse
across the code base. If we went for this strategy, we would have to rewrite them all to use some wrapper function e.g. in @backstage/integration
instead. There's also probably a fair bit of code that does direct string manipulation, or does new URL(str).pathname.split('/')
and similar trickery. So yeah, there will be a bunch of chasing down code that isn't up to par with the new requirement.
And to answer your question, we'd appreciate help with it.
Understood and totally agree @freben.
I will see what available capacity we will have and if somebody from us will start working on that, I will let you guys know. 👍
Hey guys, dn if the URLPattern API can help.
Unfortunately I think it can't. The problem is that the branch part is unencoded, and there is no way to - even theoretically - be sure where the branch ends and the path begins without making API calls to the remote. I personally think that adopting something like the kustomize format as a second supported form of url would be the cleanest way forward.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
De-stale-inating. This is still a bug. We had someone else run into it again today.
Also ran into this recently while utilizing the validation endpoint. Feeding branches with forward slashes into the location field will break the validator (if the resulting branch doesn't exist).
E.G: Your branch name is test/something
You feed this branch into the location field, and it will try to find the location at the test
branch instead of test/something
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi everyone. I faced with the same problem for GitLab. And I offer the foloowing solution
- Add to integration config option to use only root
catalog-info.yaml
placement - If that option is enabled we parse all path before the
catalog-info.yaml
as branch name - If not we have current behavior
For example we have following gitlab url https://gitlabcom/group/subgroup/project/-/blob/branch_a/branch_b/catalog-info.yaml
we need to fix that line to get in file path only file name
https://github.com/backstage/backstage/blob/30314172ed18b3c10f49a14a88d35270b69041c2/packages/integration/src/gitlab/core.ts#L80
Hi guys! I work on solution suggested me above. This solution is so bad ((( I implement another one solution Main idea is check by existing branches in repository. When we create url to get source we 1/ get branch name as is in current code 2/ get all branches from repo which names start with name got on step 1 3/ match path to getted branch names I test this solution in my company and it works I add code below. I need review for that idea. If it's ok I add this changes in integration backend
// Converts
// from: https://gitlab.com/groupA/teams/teamA/subgroupA/repoA/-/blob/branch/filepath
// to: https://gitlab.com/api/v4/projects/projectId/repository/files/filepath?ref=branch
export async function buildProjectUrl(
target: string,
projectID: Number,
config: GitLabIntegrationConfig,
): Promise<URL> {
try {
const url = new URL(target);
const branchAndFilePath = url.pathname
.split('/blob/')
.slice(1)
.join('/blob/');
let fileName: string;
let branch: string;
const [br, ...filePath] = branchAndFilePath.split('/');
let repoBranches = await getProjectBranches(target, "^" + br, projectID, config);
let findedBranch = checkExistingBranches(repoBranches, branchAndFilePath);
if (findedBranch != undefined) {
let path = branchAndFilePath.replace(findedBranch, "");
fileName = path.split('/').filter(p => p).join("/");
branch = findedBranch;
} else {
branch = br;
fileName = filePath.join('/');
}
const relativePath = getGitLabIntegrationRelativePath(config);
url.pathname = [
...(relativePath ? [relativePath] : []),
'api/v4/projects',
projectID,
'repository/files',
encodeURIComponent(decodeURIComponent(fileName)),
'raw',
].join('/');
url.search = `?ref=${branch}`;
return url;
} catch (e) {
throw new Error(`Incorrect url: ${target}, ${e}`);
}
}
// Get all branches started with some-name
export async function getProjectBranches(
target: string,
branchSearchTerm: string,
projectID: Number,
config: GitLabIntegrationConfig,
): Promise<Array<string>> {
const url = new URL(target);
if (!url.pathname.includes('/blob/')) {
throw new Error('Please provide full path to yaml file from GitLab');
}
try {
// Get gitlab relative path
const relativePath = getGitLabIntegrationRelativePath(config);
// Convert
// to: https://gitlab.com/api/v4/projects/groupA%2Fteams%2FsubgroupA%2FteamA%2Frepo
const repoIDLookup = new URL(
`${url.origin}${relativePath}/api/v4/projects/${projectID}/repository/branches?search=${encodeURIComponent(
branchSearchTerm,
)}`,
);
const response = await fetch(
repoIDLookup.toString(),
getGitLabRequestOptions(config),
);
const data = await response.json();
if (!response.ok) {
throw new Error(
`GitLab Error '${data.error}', ${data.error_description}`,
);
}
let branches = data as Array<any>;
return branches.map(b => b.name);
} catch (e) {
throw new Error(`Could not get GitLab project ID for: ${target}, ${e}`);
}
}
@ferben @Rugvip guys need a feedback
@abogdanov37 to be honest I think we want to avoid any solution that adds an extra request to the URL reading. This problem seems to still cause less trouble than rate limits, which we would start hitting a lot sooner if we add in another request. That's overall why this is such a tricky problem, there's no solution that doesn't add an extra request that we know of.
We discussed this a bit more today, and we're thinking that the best approach to this would be to parse the URL during registration in the catalog, and along with that have our own internal URL format that is able to carry the information. That's also a good opportunity to standardize URL parsing and handling across all different SCM integrations. It's a significant chunk of work though, and unless someone want to pick that up and drive it all the way, it's not something I feel is worth prioritizing over other work at this point.
I think that if you're going to be using a specific branch to reference a github entity I think that it would be entirely reasonable to just ask for and store the branch name and potential project path separately from the github project slug. e.g. just have those as fields on the catalog entity.
@abogdanov37 to be honest I think we want to avoid any solution that adds an extra request to the URL reading. This problem seems to still cause less trouble than rate limits, which we would start hitting a lot sooner if we add in another request. That's overall why this is such a tricky problem, there's no solution that doesn't add an extra request that we know of.
We discussed this a bit more today, and we're thinking that the best approach to this would be to parse the URL during registration in the catalog, and along with that have our own internal URL format that is able to carry the information. That's also a good opportunity to standardize URL parsing and handling across all different SCM integrations. It's a significant chunk of work though, and unless someone want to pick that up and drive it all the way, it's not something I feel is worth prioritizing over other work at this point.
Hi @Rugvip! Thanks for a feedback! I understand your concern. Idea with internal url format seems good but we have no time to to wait solution and have no resources to do it (((
We can implement the same behavior without addition request bu we should have in config list of branches containing /
. As for me it's not so good because we should manage this list by hand. But may be it helps you some how.
For next step I try to create and public package for Backstage with my solution. I hope it helps some one.