SwiftPackageIndex-Server icon indicating copy to clipboard operation
SwiftPackageIndex-Server copied to clipboard

Show if a package is a fork

Open daveverwer opened this issue 4 years ago • 13 comments

The SwiftPM Library recorded if a repository was a fork of another repository and displayed that and I think the SPI should do the same.

Several parts to this:

  • [ ] Record a parent_url field in the repositories table and populate it from the GitHub API call.
  • [ ] Display whether a repository is a fork on the Package page and give a link back to the original repository.

daveverwer avatar Jul 22 '20 22:07 daveverwer

Here's what I found:

  • Unused forked_from_id: Repository? in repositories It seems like an overshot for a simple link to the original repository 🤔 Can it be replaced with the parent_url field?

  • Unused isFork flag in GraphQL request Github.Repository Probably should be also replaced with

    parent {
        url
    }
    

elfenlaid avatar Jun 19 '21 15:06 elfenlaid

Yea there's a bit of history to this feature. The original SwiftPM Library actually supported displaying when a repo was a fork of another repo and linked people out to the original repository as well. It's a useful indicator, even if the fork is the one being more actively maintained.

So, this field existed as we were originally planning to replicate all existing functionality for launch, but this feature didn't quite make it and then other things took priority.

I'd love to have it as a feature though, it's a great indicator for that top section. What I'd ideally like is to be able to put something like this in:

This package is a fork of [example/OriginalPackage].

Where [example/OriginalPackage] is a link to either:

  • The package page on the SPI for the original repository, if we have it.
  • The GitHub page for the forked repository, if we don't have the original in our package list.

daveverwer avatar Jun 20 '21 09:06 daveverwer

Worth noting (though doesn't stop this ticket going through) that the package list nightly scripts currently ignore all dependencies which are forks: https://github.com/SwiftPackageIndex/PackageList-Validator/blob/100a768c6c5d2840417576a85bc4d2bf8c9bab5a/Sources/ValidatorCore/Commands/CheckDependencies.swift#L187

Once we add support for it here, we may want to consider removing that restriction.

Sherlouk avatar Jun 20 '21 09:06 Sherlouk

I think that's still sensible - we don't necessarily want to greedily pull in all forks in the nightly. It's different if someone submits a fork for including, that's more deliberate than the crawler.

finestructure avatar Jun 20 '21 10:06 finestructure

Where [example/OriginalPackage] is a link to either:

* The package page on the SPI for the original repository, if we have it.

* The GitHub page for the forked repository, if we don't have the original in our package list.

Sure, that's a cool feature to have. I have only a vague idea of how to implement it, though. Things start to get blurry when it comes to SPI original repository link.

I can save fork_name and fork_owner as separate fields in the repositories, alongside forked_from_id. Afterward, somehow update forked_from_id with the help of the fork_* fields. Maybe with an SQL trigger or in the ingestion phase 🤔

Though, said approach duplicates the scheme's data. I wonder whether it's feasible to auto-create the SPI package for the parent repo (if it's not there) during the ingestion phase?

elfenlaid avatar Jun 21 '21 05:06 elfenlaid

I think the way we should model this on the Swift side is with an enum

enum Fork {
  case parentId(Package.Id)
  case parentURL(String)
}

This would be set on ingestion. We check the URL we get back from Github against packages.url and either store the url or the id, depending on that match. It's an extra query but it goes against a unique index so its fast. We'd need to make sure the url is normalised (https, lowercased, .git extension) before matching.

The db field would be a JSONB to hold the enum. This would benefit from waiting until we move to Swift 5.5 so we get automatic enum Codable with associated values.

There's a downside here in using JSONB in that it prevents joins (or at least simple joins) but I don't think we actually need to join on this field. It's technically a relationship but I don't think we'll be using it as such in practise. This is worth thinking about more than the 5mins I've spent on it though!

finestructure avatar Jun 21 '21 07:06 finestructure

There's a downside here in using JSONB in that it prevents joins (or at least simple joins) but I don't think we actually need to join on this field. It's technically a relationship but I don't think we'll be using it as such in practise. This is worth thinking about more than the 5mins I've spent on it though!

Yes, I don't see this data being used for anything outside of this feature in the future. I don't think we lose much here.

daveverwer avatar Jun 21 '21 08:06 daveverwer

I think the way we should model this on the Swift side is with an enum

Swift's side enum is totally reasonable. But I'm not so sure with JSONB. It seems that adding parent_url field is a more generic way of handling the feature. Even though it causes data denormalization.

elfenlaid avatar Jun 21 '21 08:06 elfenlaid

I'm not sure how that would work though. If it's just a parent_url we won't be able to tell if it's a parent known to the index or just a GitHub url unless we look it up against package.url on each read, or would we?

If it's a JSONB backed by and enum we only look up the url once on ingestion and save that resulting enum to the db.

Does that make sense?

finestructure avatar Jun 21 '21 08:06 finestructure

It makes sense to do this on ingestion as long as it's a field that is checked/updated on every ingest cycle rather than only when new. It could be the case that a parent repository is not in the index when it gets added (and I agree we shouldn't automate the discovery of parent forks), but that parent may get added later.

daveverwer avatar Jun 21 '21 09:06 daveverwer

I'm not sure how that would work though. If it's just a parent_url we won't be able to tell if it's a parent known to the index or just a GitHub url unless we look it up against package.url on each read, or would we?

Yeah, the model code will check it, like:

let parentUrl = repository.forkedFrom.flatMap { /* compose SPI url */ } ?? repository.parentUrl

We just preload forkedFrom association if it present

elfenlaid avatar Jun 21 '21 12:06 elfenlaid

It makes sense to do this on ingestion as long as it's a field that is checked/updated on every ingest cycle rather than only when new. It could be the case that a parent repository is not in the index when it gets added (and I agree we shouldn't automate the discovery of parent forks), but that parent may get added later.

That's right, it might not work from the first ingest. Plus parentUrl also can be altered (like the parent's owner changed its name) 🤔

elfenlaid avatar Jun 21 '21 12:06 elfenlaid

The db field would be a JSONB to hold the enum. This would benefit from waiting until we move to Swift 5.5 so we get automatic enum Codable with associated values.

As a second thought, JSONB will do just fine. Sorry for derailing the conversation.

The main point, as Dave has mentioned, that the field should be updated on every ingest. Regardless of the field's representation.

elfenlaid avatar Jun 21 '21 13:06 elfenlaid