backstage icon indicating copy to clipboard operation
backstage copied to clipboard

Backstage doesn't recognize branches that contain forward slashes

Open taras opened this issue 4 years ago • 55 comments

Backstage relies on git-url-parse to parse URLs. This library has a bug that causes it to incorrectly parse URLs that point to branches with forward slashes in them.

Expected Behavior

I expect to be able to use branches with / because it's a useful way to organized branches into groups.

Current Behavior

If I attempt to create a template from a branch that uses a / in its name, the operation will fail because Backstage (via git-url-parse) will assume that I'm trying to get code from branch called tm. It will ignore everything past first /.

Possible Solution

🤷‍♀️

Steps to Reproduce

  1. Try to create a template from https://github.com/spotify/backstage/blob/dependabot/npm_and_yarn/testing-library/jest-dom-5.11.4/plugins/scaffolder-backend/sample-templates/create-react-app/template.yaml

Context

Some of the projects I'm involved in require that each team member prefixes their branch with their initials. It's a habit at this point to start every branch name with tm/.

taras avatar Oct 08 '20 20:10 taras

Oof - thanks for reporting, of course this should work and I always do the same.

That library has been a bit less than stellar, I wonder if there are alternatives. Then again, just looking at the url in your example, it looks hard to deduce where the ref ends and the path begins...

freben avatar Oct 08 '20 21:10 freben

I created a runkit that has every GitHub URL parsing library that I could find. None of them support properly detecting the branch name. We might need a completely different strategy for doing this.

taras avatar Oct 09 '20 10:10 taras

https://github.com/spotify/backstage/blob/blob/blob/blob/blob/blob/blob.txt

There is actually a bit of method to this madness though. Git itself has a limitation that you can only create branches that can be represented in the filesystem. For example, you can't have both the branch a/b and a/b/c, because when you create a/b b is a text file on disk, and therefore you can't create a/b/c since b would have to be a folder for that.

With that in mind it means there's at least a deterministic way to interpret the URL, the annoying bit is that it includes using the GitHub API to discover the branches that are present in the repo.

Rugvip avatar Oct 09 '20 10:10 Rugvip

the annoying bit is that it includes using the GitHub API to discover the branches that are present in the repo

I was thinking the same thing.

There could be another approach that could work but I don't know how universal it wold be.

GitHub website doesn't seem to care about extra forward slashes. We could for example use // forward slash to represent beginning and end of a ref. It would also be annoying, but could be better than auto detecting because the user is expressing their intention by adding the extra forward slash.

For example, 'https://github.com/spotify/backstage/blob//dependabot/npm_and_yarn/testing-library/jest-dom-5.11.4//package.json' still leads to the same file on the GitHub website.

taras avatar Oct 09 '20 10:10 taras

@taras tbh we want users to be able to just copy the address from the address bar. Otherwise we could just go with our own shorter format too

Rugvip avatar Oct 09 '20 11:10 Rugvip

true true. I'm going to email a friend at Microsoft to see if he can connect us with someone at GitHub. I want to find out from them if there is a way that they'd recommend.

taras avatar Oct 09 '20 11:10 taras

Like I commented in https://github.com/IonicaBizau/git-url-parse/pull/114#issuecomment-706091677, I don't think there is a way to split the url itself into branch name and filename... Pinging @izuzak here just in case he has any ideas how this can be improved. :sparkle:

IonicaBizau avatar Oct 09 '20 11:10 IonicaBizau

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 29 '21 12:03 stale[bot]

@mtlewis and I spent a hot second looking at the code in question. As mentioned, there's not enough information in the URL to tell branch name versus path when the branch has a slash. I didn't find any alternative URL formats that work with GitHub where this might be possible.

This code could have some additional logic, though. I think we could either:

  1. List all branches for the repo to match a substring
  2. Catch the 404 and retry, expanding the branch name to the next /

Since (1) could suffer on a repo with a ton of branches, and could have false positives, the second seems preferable to me. We could retry N times, expanding to the next slash each time, where N is a fairly low number.

timbonicus avatar Jul 08 '21 20:07 timbonicus

One issue iirc is that there's a lot of code that just uses git-url-parse or things like resolveUrl that aren't even async in the first place so there's very little room for doing smart or retrying or remote-querying things. Feels to me like it unfortunately would take a big refactoring to make it good.

freben avatar Jul 08 '21 21:07 freben

Evening, I'm not a developer but doing loops to attempt to work out if its a branch or a path seems inelegant.

In other projects I've seen the URL is normally constructed for git repos with parameters that get extracted to let the client libs know what the branch is;

For example;

Terraform git::https://example.com/vpc.git?ref=v1.2.0 Helmfile name: remote-repo url: git+https://github.com/repo/test@deploy/helm?ref=master

kustomize github.com/kubernetes-sigs/kustomize/examples/multibases?ref=v1.0.6

eperdeme avatar Jul 08 '21 21:07 eperdeme

@eperper thanks for those pointers - interesting. One thing to take into account here is convenience of copy+paste as mentioned by @Rugvip above. Having to form a custom url might be error prone. That being said, I think that approach can be worth considering nonetheless, especially if it's an "alternative form" that you can use only when necessary or if it's easily available somehow in the web interface besides the address bar of the browser.

freben avatar Jul 08 '21 21:07 freben

Hi guys, what's a status on this one? Unfortunately it's a TechDocs show stopper for us.. 😢

FilipPyrek avatar Sep 01 '21 15:09 FilipPyrek

Sorry but this is still a current issue. There's some discussion in the comments above. Having unencoded slashes in the URL makes it super problematic to parse and understand GitHub URLs - there's no surefire way to actually know where the branch ends and the path begins, without resorting to hammering the GitHub API with queries to resolve the situation - and this needs to be done in a lot of places where the code isn't even async ready, but rather expects to be able to do that parsing instantly and locally.

freben avatar Sep 01 '21 17:09 freben

@freben I like the idea of "alternative form" which would solve the problem but didn't require any groundbreaking changes. I like format where ref is supplied as a query string. That solution would make it extensible also for other information in future - if needed. Do you guys at @backstage prefer to implement this yourself? Or is this okay to be done by a community? Do you have any requirements & tips how you would like this to be implemented (and where)?

FilipPyrek avatar Sep 02 '21 06:09 FilipPyrek

@FilipPyrek Yeah that's probably the most viable way forward that we have seen so far, at least. We could make sure to implement support for the alternative form as well as the original native form - we do want to keep the convenience of copy+paste from the browser.

There's TONS of direct usages of git-url-parse across the code base. If we went for this strategy, we would have to rewrite them all to use some wrapper function e.g. in @backstage/integration instead. There's also probably a fair bit of code that does direct string manipulation, or does new URL(str).pathname.split('/') and similar trickery. So yeah, there will be a bunch of chasing down code that isn't up to par with the new requirement.

freben avatar Sep 02 '21 06:09 freben

And to answer your question, we'd appreciate help with it.

freben avatar Sep 02 '21 06:09 freben

Understood and totally agree @freben.

I will see what available capacity we will have and if somebody from us will start working on that, I will let you guys know. 👍

FilipPyrek avatar Sep 02 '21 10:09 FilipPyrek

Hey guys, dn if the URLPattern API can help.

SevenOutman avatar Oct 24 '21 04:10 SevenOutman

Unfortunately I think it can't. The problem is that the branch part is unencoded, and there is no way to - even theoretically - be sure where the branch ends and the path begins without making API calls to the remote. I personally think that adopting something like the kustomize format as a second supported form of url would be the cleanest way forward.

freben avatar Oct 24 '21 08:10 freben

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jul 15 '22 14:07 github-actions[bot]

De-stale-inating. This is still a bug. We had someone else run into it again today.

dallasgutauckis avatar Jul 20 '22 16:07 dallasgutauckis

Also ran into this recently while utilizing the validation endpoint. Feeding branches with forward slashes into the location field will break the validator (if the resulting branch doesn't exist).

E.G: Your branch name is test/something

You feed this branch into the location field, and it will try to find the location at the test branch instead of test/something.

dhhuynh2 avatar Aug 22 '22 20:08 dhhuynh2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 21 '22 20:10 github-actions[bot]

Hi everyone. I faced with the same problem for GitLab. And I offer the foloowing solution

  1. Add to integration config option to use only root catalog-info.yaml placement
  2. If that option is enabled we parse all path before the catalog-info.yaml as branch name
  3. If not we have current behavior

For example we have following gitlab url https://gitlabcom/group/subgroup/project/-/blob/branch_a/branch_b/catalog-info.yaml we need to fix that line to get in file path only file name https://github.com/backstage/backstage/blob/30314172ed18b3c10f49a14a88d35270b69041c2/packages/integration/src/gitlab/core.ts#L80

abogdanov37 avatar Nov 30 '22 14:11 abogdanov37

Hi guys! I work on solution suggested me above. This solution is so bad ((( I implement another one solution Main idea is check by existing branches in repository. When we create url to get source we 1/ get branch name as is in current code 2/ get all branches from repo which names start with name got on step 1 3/ match path to getted branch names I test this solution in my company and it works I add code below. I need review for that idea. If it's ok I add this changes in integration backend

// Converts
// from: https://gitlab.com/groupA/teams/teamA/subgroupA/repoA/-/blob/branch/filepath
// to:   https://gitlab.com/api/v4/projects/projectId/repository/files/filepath?ref=branch
export async function buildProjectUrl(
    target: string,
    projectID: Number,
    config: GitLabIntegrationConfig,
): Promise<URL> {
    try {
        const url = new URL(target);

        const branchAndFilePath = url.pathname
            .split('/blob/')
            .slice(1)
            .join('/blob/');

        let fileName: string;
        let branch: string;
        const [br, ...filePath] = branchAndFilePath.split('/');
        let repoBranches = await getProjectBranches(target, "^" + br, projectID, config);
        let findedBranch = checkExistingBranches(repoBranches, branchAndFilePath);
        if (findedBranch != undefined) {
            let path = branchAndFilePath.replace(findedBranch, "");
            fileName = path.split('/').filter(p => p).join("/");
            branch = findedBranch;
        } else {
            branch = br;
            fileName = filePath.join('/');
        }
        
        const relativePath = getGitLabIntegrationRelativePath(config);

        url.pathname = [
            ...(relativePath ? [relativePath] : []),
            'api/v4/projects',
            projectID,
            'repository/files',
            encodeURIComponent(decodeURIComponent(fileName)),
            'raw',
        ].join('/');

        url.search = `?ref=${branch}`;

        return url;
    } catch (e) {
        throw new Error(`Incorrect url: ${target}, ${e}`);
    }
}

// Get all branches started with some-name
export async function getProjectBranches(
    target: string,
    branchSearchTerm: string,
    projectID: Number,
    config: GitLabIntegrationConfig,
): Promise<Array<string>> {
    const url = new URL(target);

    if (!url.pathname.includes('/blob/')) {
        throw new Error('Please provide full path to yaml file from GitLab');
    }

    try {
        // Get gitlab relative path
        const relativePath = getGitLabIntegrationRelativePath(config);

        // Convert
        // to: https://gitlab.com/api/v4/projects/groupA%2Fteams%2FsubgroupA%2FteamA%2Frepo
        const repoIDLookup = new URL(
            `${url.origin}${relativePath}/api/v4/projects/${projectID}/repository/branches?search=${encodeURIComponent(
                branchSearchTerm,
            )}`,
        );

        const response = await fetch(
            repoIDLookup.toString(),
            getGitLabRequestOptions(config),
        );

        const data = await response.json();

        if (!response.ok) {
            throw new Error(
                `GitLab Error '${data.error}', ${data.error_description}`,
            );
        }
        let branches = data as Array<any>;
        return branches.map(b => b.name);
    } catch (e) {
        throw new Error(`Could not get GitLab project ID for: ${target}, ${e}`);
    }
}

abogdanov37 avatar Dec 06 '22 13:12 abogdanov37

@ferben @Rugvip guys need a feedback

abogdanov37 avatar Dec 07 '22 07:12 abogdanov37

@abogdanov37 to be honest I think we want to avoid any solution that adds an extra request to the URL reading. This problem seems to still cause less trouble than rate limits, which we would start hitting a lot sooner if we add in another request. That's overall why this is such a tricky problem, there's no solution that doesn't add an extra request that we know of.

We discussed this a bit more today, and we're thinking that the best approach to this would be to parse the URL during registration in the catalog, and along with that have our own internal URL format that is able to carry the information. That's also a good opportunity to standardize URL parsing and handling across all different SCM integrations. It's a significant chunk of work though, and unless someone want to pick that up and drive it all the way, it's not something I feel is worth prioritizing over other work at this point.

Rugvip avatar Dec 08 '22 12:12 Rugvip

I think that if you're going to be using a specific branch to reference a github entity I think that it would be entirely reasonable to just ask for and store the branch name and potential project path separately from the github project slug. e.g. just have those as fields on the catalog entity.

zcstellar avatar Dec 08 '22 18:12 zcstellar

@abogdanov37 to be honest I think we want to avoid any solution that adds an extra request to the URL reading. This problem seems to still cause less trouble than rate limits, which we would start hitting a lot sooner if we add in another request. That's overall why this is such a tricky problem, there's no solution that doesn't add an extra request that we know of.

We discussed this a bit more today, and we're thinking that the best approach to this would be to parse the URL during registration in the catalog, and along with that have our own internal URL format that is able to carry the information. That's also a good opportunity to standardize URL parsing and handling across all different SCM integrations. It's a significant chunk of work though, and unless someone want to pick that up and drive it all the way, it's not something I feel is worth prioritizing over other work at this point.

Hi @Rugvip! Thanks for a feedback! I understand your concern. Idea with internal url format seems good but we have no time to to wait solution and have no resources to do it (((
We can implement the same behavior without addition request bu we should have in config list of branches containing /. As for me it's not so good because we should manage this list by hand. But may be it helps you some how. For next step I try to create and public package for Backstage with my solution. I hope it helps some one.

abogdanov37 avatar Dec 08 '22 21:12 abogdanov37