registry icon indicating copy to clipboard operation
registry copied to clipboard

Validate that submitted server.json's are referencing valid source code registries

Open tadasant opened this issue 7 months ago • 5 comments

We want to avoid the following situations:

  • Someone submitting a server.json that has a reference to a package on npm which they do not have access to
  • Someone spamming the registry by submitting multiple instances of the same package

Because the only notion of identity we have is the user's GitHub OAuth plus optionally a DNS verification, we don't really have any solid data to tie all the way back to a package on e.g. npm or pypi.

I think we will need to implement auth flows for each external registry we intend to support (i.e. in order to publish a reference to an npm package, you need to "login with npm" or use an npm auth token and prove you have access to that package).

Proposal: for go-live, implement this for the community with npm, pypi, and ghcr. Allow the community to easily extend to support additional registries, so if they want to publish in less popular registries, they can do so by helping us with the long tail of possible extensions here.

Open to other ideas here; maybe there is a less heavy handed solution for this.

tadasant avatar May 27 '25 23:05 tadasant

We could run the server according to the instructions in server.json in a containerized environment to validate that it exists in the reported package managers (related to #93). If the servers exposed the server.json or .well-known with the reverse DNS name, we could compare what the server itself reports when run to the server.json that is being published. We'd need a way to expose that in a local server. We could also potentially grab the list_tools for the registry from this run during install if we wanted as well.

toby avatar Jun 02 '25 20:06 toby

@sridharavinash pointed the OpenSSF WG to this issue and it's how I found it. Thanks Avinash.

It seems like the main problem is establishing a trustworthy link between the server.json and the repo and/or the package artifact. It's not clear to me if this should be a bidirectional link or a link just FROM the MCP server entry TO the repo/underlying package. I will assume the latter for now.

One way this could be done is by embedding the server.json in the repo/package. This would be another benefit to my discussion item here: https://github.com/modelcontextprotocol/registry/discussions/118.

For example, the MCP registry could verify that server.json is in the root of the package ZIP/.tar.gz matches. Could maybe be generic code for git repositories, but specific if validating the package artifact (specific to the registry type). The MCP registry name property (reverse DNS name?) could be used for the comparison.

Another way could be using the Trusted Publishing flow. For example, a GitHub Actions OIDC token could be accepted for pushing the server.json to the MCP registry and the source GitHub Action repo embedded in the OIDC token could be checked. https://repos.openssf.org/trusted-publishers-for-all-package-repositories

joelverhagen avatar Jun 03 '25 17:06 joelverhagen

If you haven't considered this already, I would suggest looking at https://docs.github.com/en/actions/security-for-github-actions/using-artifact-attestations/using-artifact-attestations-to-establish-provenance-for-builds.

It allows you to tie your package version to not only provenance information like repository, what environment build it and so forth, but also extend it with whatever custom metadata one decides to add.

rdimitrov avatar Jun 05 '25 09:06 rdimitrov

One way this could be done is by embedding the server.json in the repo/package. This would be another benefit to my discussion item here: #118.

I like this idea, but wasn't sure if we could reliably embed / extract server.json in a package. So I asked Gemini 2.5 Pro about popular package file formats and including arbitrary files:

Package file formats and arbitrary files ...
Programming Language Package Manager(s) Package File Format(s) How to Include Arbitrary File(s) in Package
Python pip .whl (Wheel, ZIP-based), .tar.gz or .zip (sdist - source distribution) Via MANIFEST.in (for sdist), package_data or include_package_data in setup.py/pyproject.toml (for wheels).
JavaScript npm, yarn .tgz (gzipped TAR archive) Via the files array in package.json, or by simply including them in the project directory (respecting .npmignore / .gitignore).
Java Maven, Gradle .jar (Java Archive), .war, .ear (all are ZIP-based) Place files in src/main/resources (convention). Can be configured in pom.xml (Maven <resources>) or build.gradle[.kts] (Gradle sourceSets).
C# NuGet .nupkg (NuGet Package, ZIP-based) Via <files> element in .nuspec, or <Content Include="path/to/file" Pack="true" PackagePath="contentFiles/any/any/yourfile"/> or <None Include="..."/> in .csproj (SDK-style).
PHP Composer VCS checkout (e.g., Git clone) or ZIP/TAR archives (often .zip if archived by registry) Typically by committing the file to the VCS. For archives, ensure it's not excluded by archive.exclude in composer.json.
Ruby RubyGems .gem (TAR-based, often gzipped) Add to the files array in the .gemspec file (often populated using git ls-files -z.split("\x0")).
Go Go Modules VCS checkout (e.g., Git) or .zip (from module proxies) Commit the file to the VCS. For embedding in binaries, use //go:embed directive.
Rust Cargo .crate (gzipped TAR archive - .tar.gz) Via include key in Cargo.toml or ensure it's tracked by VCS (default). build.rs can also copy files into $OUT_DIR.
Swift Swift Package Manager (SPM) VCS checkout (e.g., Git clone) Define in Package.swift within a target's resources parameter (e.g., .process("path/to/file") or .copy("path/to/folder")).
Kotlin Gradle, Maven .jar (ZIP-based, same as Java for JVM targets) Same as Java (e.g., src/main/resources). For Kotlin Multiplatform, configure in build.gradle.kts source sets for relevant targets.
TypeScript npm, yarn .tgz (gzipped TAR archive, same as JavaScript) Same as JavaScript. Ensure non-TS files are copied to the build output directory (e.g., dist/) by your build process if not directly included by npm/yarn.
Scala sbt, Maven, Gradle .jar (ZIP-based, same as Java) Place in src/main/resources (convention for sbt, Maven, Gradle). Configurable in build.sbt or other build files (e.g., Compile / resourceDirectories in sbt).
R install.packages() (via CRAN) .tar.gz (source), .zip (Windows binary), .tgz (macOS binary) Place in the inst/ directory (files will be copied to the top-level of the installed package) or data/ for R data objects.
Perl CPAN (clients like cpanm) .tar.gz List in MANIFEST file. Build scripts (Makefile.PL or Build.PL) can specify installation using directives like SHARE_DIR (e.g., with Module::Install or Dist::Zilla).
Dart pub .tar.gz Include in the project directory (not ignored by .pubignore / .gitignore). For Flutter assets, list under flutter: assets: in pubspec.yaml.
Objective-C CocoaPods, Carthage VCS checkout or direct download; pre-built binaries (e.g., .framework.zip) CocoaPods: s.resources or s.resource_bundles in .podspec. Carthage: Include in the built framework/XCFramework project structure.
Haskell Cabal, Stack .tar.gz List in the data-files field of the .cabal file.
Elixir Hex (via Mix) .tar (internally gzipped by Hex before storage/transfer) Add to the package.files list in mix.exs (defaults to lib, priv, mix.exs, README*, LICENSE*). Typically place in priv/.
Lua LuaRocks .rock or .src.rock (ZIP-based) Specify in the .rockspec file, often by including them in directories listed under build.copy_directories or referenced by build steps.
Julia Pkg VCS checkout (e.g., Git); downloadable artifacts often as .tar.gz Commit the file to the VCS. For large/binary data, use Artifacts.toml to declare downloadable artifacts, which Pkg then manages.

The TL;DR is that all the popular package file formats are either ZIP or TAR/GZIP (i.e. easily extractable), and there is some native way to ensure that an arbitrary file is included in them.

However, it's worth mentioning that there can be nesting inside a package archive. In particular, a Ruby .gem file is a TAR file that contains a data.tar.gz file, which in turn would contain server.json (see Gem::Package for more info). So we would need package-specific extraction processes, but it seems doable.

For example, the MCP registry could verify that server.json is in the root of the package ZIP/.tar.gz matches. Could maybe be generic code for git repositories, but specific if validating the package artifact (specific to the registry type). The MCP registry name property (reverse DNS name?) could be used for the comparison.

I don't think we can mandate that server.json be at the root. I think most package managers don't offer that level of control. But, instead, we could say that exactly one **/.mcp/server.json file must exist somewhere in the archive.

jonathanhefner avatar Jun 13 '25 15:06 jonathanhefner

We could be a little crazy and use **/.well-known/server.json file path in the respective archive format (just aligning with the HTTP shape). .mcp is good too, and more natural for a file system.

I really like the idea of some defined subdirectory. TBH, we can just try something with NuGet (what I'm working on now) and see how it works out. Anyone on this thread, LMK what convention you think is good any I can try applying it to NuGet!

However, it's worth mentioning that there can be nesting inside a package archive. In particular, a Ruby .gem file is a TAR file that contains a data.tar.gz file, which in turn would contain server.json (see Gem::Package for more info). So we would need package-specific extraction processes, but it seems doable.

Yup, definitely package specific.

I am not sure how this would work for docker. Perhaps we can scan the file system layers from last to first...

joelverhagen avatar Jun 13 '25 16:06 joelverhagen

I think there's a few things being discussed in this issue, that I'm hoping to try to peel apart so that we can identify which are truly go-live blockers, and then gain consensus on those quickly.

  1. Validate that submitted server.json's are referencing an existing package

(or formulated as a problem: avoid user disappointment due to them finding an interesting MCP, trying to install it, and then finding the underlying package has been deleted so it doesn't work)

I think maybe we should move this discussion to #93?

  1. Someone submitting a server.json that has a reference to a package on npm which they do not have access to

I'm not sure what underlying problem this is tackling exactly? This might be the "claiming credit for someone else's package", although I think this is probably unlikely to be too bad.

I think the proposal to embed a server.json in the package feels nice/satisfying to me somehow. Although I have two reservations:

  • It might create a bit of a weird recursive loop that breaks in practice sometimes. E.g. it's hard (~impossible?) to know the sha of your Docker image before you've built it, but might want to reference that in your server.json to specify the version. This then doesn't work if your server.json is inside the image you are building.
  • I think adding even more extra requirements on server owners is a little painful, and I think one of the MCP principles is 'make server dev easy', so adding extra stuff of having to ensure you're embedding a file just adds overhead.

I might again lean on moderation to solve this for go-live, and we see if this is a problem.

  1. Someone spamming the registry by submitting multiple instances of the same package

I think this might be quite hard to catch programmatically.

I think we can't rely on the underlying specific package, because there are some 'multipack' MCPs, where you might want to publish something like:

  • com.example/airtable -> npx -y example-mcp-multipack serve airtable
  • com.example/bittorrent -> npx -y example-mcp-multipack serve bittorrent
  • com.example/cassandra -> npx -y example-mcp-multipack serve cassandra

Maybe we could check if two packages have different names and the exact same packages array then that is weird. But this would be very easy to circumvent, e.g. add a parameter that is ignored by your server.

So I think we might just have to rely on a moderation policy, and publishing limits (#21).

domdomegg avatar Aug 21 '25 02:08 domdomegg

I think maybe we should move this discussion to https://github.com/modelcontextprotocol/registry/issues/93?

Could be the same solution, though I think the scenarios are different, in that #93 is likely to occur after initial publication, versus this issue was about it happening at publication-time.

Someone submitting a server.json that has a reference to a package on npm which they do not have access to

I'm not sure what underlying problem this is tackling exactly? This might be the "claiming credit for someone else's package", although I think this is probably unlikely to be too bad.

I'm not sure what underlying problem this is tackling exactly? This might be the "claiming credit for someone else's package", although I think this is probably unlikely to be too bad.

This happens a lot. Probably like 5-10% of packages I've reviewed on GitHub are referencing an incorrect package. It's usually not intentional, but vibe-coded like package_name: bigquery when you're building a community BigQuery server. It's a terrible experience in that you never really know whether the package is the actual MCP server or if someone just messed up the metadata.

Someone spamming the registry by submitting multiple instances of the same package

Good point on the multipack problem, however I haven't seen this much (ever?) in practice, and I've reviewed many thousands of MCP server configs. So I don't mind that sacrifice.

I guess someone could just publish lots of packages as a workaround, but that adds more layers of friction and layers of disincentives (bypassing package registry anti-spam measures, decentralization of their stats and code, etc).


Sorry I don't have a proposal for a great solution here yet and am tight on time; just wanted to get the data points out there

tadasant avatar Aug 21 '25 15:08 tadasant

This happens a lot. Probably like 5-10% of packages I've reviewed on GitHub are referencing an incorrect package. It's usually not intentional, but vibe-coded like package_name: bigquery when you're building a community BigQuery server. It's a terrible experience in that you never really know whether the package is the actual MCP server or if someone just messed up the metadata.

Ah that is quite a useful/important insight into the use case. And damn that is quite annoying... because it's a fairly difficult thing to check reliably. I also don't have any great solutions off the top of my head... @claude have you got any ideas?

domdomegg avatar Aug 22 '25 03:08 domdomegg

Claude finished @domdomegg's task —— View job


Package Validation Ideas for MCP Registry

Based on my analysis of the codebase and the problem described, here are several practical approaches to address the 5-10% incorrect package reference issue:

Analysis and Solutions

Todo List

  • [x] Analyze current codebase to understand existing validation
  • [x] Research the problem space and constraints
  • [x] Propose practical validation solutions

Current State

The registry currently has:

  • JWT-based authentication via GitHub OAuth/OIDC
  • Basic permission checking for repository ownership
  • JSON schema validation for server.json format
  • Support for multiple package registries (npm, pypi, docker, etc.)

Problem Analysis

The core issue is trust linkage - ensuring the submitted server.json references packages the submitter actually controls. As @tadasant noted, this happens 5-10% of the time, often unintentionally (e.g., using package_name: bigquery instead of the actual package name).

Proposed Solutions

1. Registry-Specific Validation APIs (Recommended for MVP)

Implement lightweight validation for each supported package registry:

For npm:

  • Query https://registry.npmjs.org/{package-name} to verify package exists
  • Check the repository.url field in package.json matches the GitHub repo from the JWT token
  • Most npm packages already link back to their GitHub repos

For PyPI:

  • Query https://pypi.org/pypi/{package-name}/json
  • Check project_urls.Homepage or project_urls.Repository for GitHub URL match

For Docker Hub:

  • Query https://hub.docker.com/v2/repositories/{namespace}/{repo}/
  • Check linked repository or description for GitHub URL

Benefits:

  • Works with existing infrastructure
  • No additional auth flows needed
  • Catches most accidental misconfigurations
  • Lightweight API calls during publish

2. Embedded server.json Verification (Future Enhancement)

As discussed in the thread, require server.json (or .mcp/server.json) to be embedded in the actual package:

  • Download and extract the referenced package
  • Compare embedded server.json with submitted one
  • Ensures bidirectional verification

Benefits:

  • Strongest validation possible
  • Prevents impersonation
  • Self-documenting packages

Challenges:

  • Requires package-specific extraction logic
  • Circular dependency issues (Docker SHA problem)
  • Additional overhead for publishers

3. Publisher Verification Tokens (OAuth Extension)

Extend the current GitHub OAuth to include package registry verification:

  • During token creation, optionally verify npm/PyPI accounts
  • Store verified package registry accounts in JWT claims
  • Only allow publishing packages from verified accounts

Benefits:

  • Definitive ownership proof
  • Prevents malicious impersonation
  • Scales to new registries

Challenges:

  • More complex auth flow
  • Requires integration with multiple OAuth providers
  • Higher barrier to entry

4. Hybrid Approach (Recommended Implementation Path)

Phase 1 (MVP): Registry API validation (#1)

  • Quick to implement
  • Catches most accidental errors
  • Low barrier to entry

Phase 2: Add embedded server.json verification (#2)

  • Stronger validation for willing publishers
  • Optional initially, required later

Phase 3: Extended OAuth validation (#3)

  • For high-trust use cases
  • Optional premium verification

Implementation Priority

Start with Registry API validation as it provides the best ROI:

  • Solves the 5-10% accidental misconfiguration problem
  • Minimal implementation effort
  • No breaking changes to existing flow
  • Can be added to the existing publish endpoint validation

The validation could be added to internal/api/handlers/v0/publish.go after the JWT validation but before calling registry.Publish().

Would you like me to elaborate on any of these approaches or start implementing the registry API validation solution?


claude[bot] avatar Aug 22 '25 03:08 claude[bot]

(I think Claude's ideas are probably too heavyweight/restrictive. E.g. requiring repository.id to match github auth, or oauthing with package registries. I think the least-worst idea is including server.json in packages but I still think that adds quite a bit of maintainer burden which is painful).

domdomegg avatar Aug 22 '25 03:08 domdomegg

Maybe we do just try to run the servers e.g. in docker somewhere to check they work? Possibly this is beyond the scope of what we want to be doing at official registry level.

Plausibly a subregistry could do this though, and maybe pingback with #292 - with a signal that can nudge us to remove the package if that scan suggests it's almost certainly not a valid MCP server. Bit delayed though.

domdomegg avatar Aug 22 '25 03:08 domdomegg

I really think it's worth discussing how we can leverage Github's Attestations (or just sigstore) to help solve these problems since it provides exactly this (temper-proof way of linking/adding metadata to artefacts, be it container images, packages, regular files, etc.).

I have a few folks that I can pull in from the Github side where we worked together on related supply chain projects if needed. @toby - I know you are also from Github so perhaps we can schedule something together.

rdimitrov avatar Aug 22 '25 08:08 rdimitrov

Maybe we do just try to run the servers e.g. in docker somewhere to check they work?

I haven't personally tried to do this across the ecosystem, but have read about several folks trying this and it's a rabbit hole (to successfully create docker setups from the long tail of possible ways people set up their servers). I think it's probably more than we want to bite off here.

I really think it's worth discussing how we can leverage Github's Attestations (or just sigstore) to help solve these problems since it provides exactly this (temper-proof way of linking/adding metadata to artefacts, be it container images, packages, regular files, etc.).

Attesting to package contents and provenance are interesting ideas, but feel potentially heavy handed for the particular problem here? We're not really looking for proof that the code/package matches up; we just want to know whether the person executing the publish command also has the right to publish to the referenced npm/pypi/ghcr package name.

Of course, a full attestation/provenance solution would implicitly solve this too, but maybe there is a less heavy handed solution here. I think someone mentioned previously (I forget where) the idea of, rather than embedding the whole server.json in the source code, maybe we do something like mcp.name.txt that is a single line text file with the server name? Feels hacky, but I think it would work and not create that kind of self-referential problem with server.json that @domdomegg mentioned earlier. Maybe there's a cleaner way to design that.

tadasant avatar Aug 22 '25 14:08 tadasant

  • It might create a bit of a weird recursive loop that breaks in practice sometimes. E.g. it's hard (~impossible?) to know the sha of your Docker image before you've built it, but might want to reference that in your server.json to specify the version. This then doesn't work if your server.json is inside the image you are building.

The Docker situation is a good callout. However in practice do we think people will point to a SHA in their Docker package pointers? Or would they use a tag? This would allow the workflow to roughly be:

  1. Compute a desired tag (such as v1.2.0) in the build pipeline
  2. Embed this version into the server.json, as a Docker image tag
  3. Embed the server.json into the image
  4. Push the image to DockerHub
  5. Update the tag v1.2.0 to point to the SHA
  6. Push the server.json to the MCP registry

(CI/CD for a Docker based MCP server)

  • I think adding even more extra requirements on server owners is a little painful, and I think one of the MCP principles is 'make server dev easy', so adding extra stuff of having to ensure you're embedding a file just adds overhead.

Yes, this is a good point. I think this could be improved with build tooling that sits on top of the existing eocosystem-specific infra.

In NuGet, we're putting the server.json in the .nupkg. This has the added benefit of allowing a "install NuGet MCP" flow not dependent on the MCP Registry (useful in cases when, say, an enterprise has an existing NuGet package feed but not an MCP Registry). It also adds determinism to discovering the inputs (such as required CLI args) for the tool package, instead of scraping the README.

We're also thinking about an MSBuild target that fills in a template server.json and puts it into the .nupkg (our package artifact). There's a prototype here: https://github.com/Azure/azure-mcp/pull/432/files#diff-72de5e8f5eec18ccb844cb21186877e04dbb79e31b1a56c392cf0ffbe0056b91 for the Azure MCP server.

I'm happy to update the NuGet guidance once this is decided, but for now the embedded server.json is providing a portable, vendor agnostic way of declaring inputs for a local MCP server.

joelverhagen avatar Aug 22 '25 14:08 joelverhagen

I think someone mentioned previously (I forget where) the idea of, rather than embedding the whole server.json in the source code, maybe we do something like mcp.name.txt that is a single line text file with the server name?

I'd support this, although worry it does create yet another file format for server devs to be confused by.

The Docker situation is a good callout. However in practice do we think people will point to a SHA in their Docker package pointers? Or would they use a tag?

I think sha hash is pretty important for downstream consumers being able to trust your server, because docker tags are not immutable (e.g. you can't do a security scan and associate it with a docker tag reliably, but can with a sha hash).

Plausibly we should require it when submitting servers, or maybe server side we transform the server.json to stamp on the latest hash for that tag at time of publish.


Zooming out back to the issue though, maybe a server.json in the package somewhere is sufficient - even if just the name matches up. And we ignore whether all the other data matches up. I think this would effectively stop vibe coded packages being linked to the wrong package (ref).

domdomegg avatar Aug 23 '25 08:08 domdomegg

@claude can you suggest how we can implement this for all the different registry types we support? Ideally in a simple, maintainable way, without putting too much burden on server publishers. This might just be making some basic API requests, but if certain registry types are fiddly, highlight potential go libraries or other approaches we could use.

Read all the comments on this issue first. I think the idea behind most proposals was checking something in the package itself, e.g. looking for a server.json with at least the same name in the package.

domdomegg avatar Sep 01 '25 21:09 domdomegg

Claude finished @domdomegg's task —— View job


Package Validation Implementation Analysis

Comprehensive research on implementing package validation across all supported registry types to prevent the 5-10% incorrect package reference issue identified by @tadasant.

Complete Implementation Analysis & Recommendations

Todo List

  • [x] Analyze current codebase validation structure
  • [x] Research supported registry types and their package formats
  • [x] Propose implementation approach for each registry type
  • [x] Identify Go libraries and tools for package extraction
  • [x] Provide recommendations balancing validation vs. publisher burden

Current State Analysis

Supported Registry Types:

  • npm (Node.js packages) - .tgz format
  • pypi (Python packages) - .whl/.tar.gz formats
  • oci (Docker/OCI containers) - Multi-layer images
  • nuget (.NET packages) - .nupkg ZIP format
  • mcpb (MCP Bundles) - Direct downloads from GitHub/GitLab

Current Validation:

  • JWT authentication + namespace permission checking
  • Schema validation via validators.ValidatePublishRequest()
  • Limited package validation (only MCPB host allowlisting)
  • Gap: No ownership validation of referenced packages

Recommended Implementation Strategy

Phase 1: Low-Burden Validation (Quick Win)

Target: Solve the 5-10% accidental misconfiguration problem with minimal publisher burden.

Approach: Registry API + Repository URL validation

  • Check if package exists via registry APIs
  • Validate repository URLs in package metadata match GitHub namespace
  • No package downloading required
// internal/validators/package_ownership.go
type PackageOwnershipValidator struct {
    httpClient *http.Client
}

func (v *PackageOwnershipValidator) ValidateNPMOwnership(pkg *model.Package, githubRepo string) error {
    // 1. GET https://registry.npmjs.org/{package-name}
    // 2. Check package.repository.url matches expected GitHub repo
    // 3. Verify package exists and is public
}

Registry-Specific APIs:

  • NPM: https://registry.npmjs.org/{package-name}
  • PyPI: https://pypi.org/pypi/{package-name}/json
  • NuGet: https://api.nuget.org/v3-flatcontainer/{package-id}/index.json
  • Docker Hub: https://hub.docker.com/v2/repositories/{namespace}/{repo}/

Phase 2: Content Validation (Strong Validation)

Target: Prevent malicious impersonation by verifying server.json exists in actual packages.

Approach: Download + extract packages, look for embedded server.json with matching name.

Go Libraries by Registry Type:

NPM (.tgz files) - EASIEST

// Standard library sufficient
import (
    "archive/tar"
    "compress/gzip"  
    "encoding/json"
)

func validateNPMContent(packageName, version, expectedName string) error {
    // Download .tgz, extract tar.gz, find package/server.json
}

NuGet (.nupkg = ZIP) - EASY

import "archive/zip"

func validateNuGetContent(packageId, version, expectedName string) error {
    // Download .nupkg, extract ZIP, find server.json
}

PyPI (.whl/.tar.gz) - MODERATE

// Handle both wheel (ZIP) and source (tar.gz) formats
func validatePyPIContent(packageName, version, expectedName string) error {
    // Try .whl first (ZIP format), fallback to .tar.gz
}

Docker/OCI - COMPLEX

import "github.com/google/go-containerregistry/pkg/v1/remote"

func validateOCIContent(imageName, tag, expectedName string) error {
    // Download image layers, reconstruct filesystem, search for server.json
    // Most challenging due to multi-GB images and layer complexity
}

MCPB - MODERATE (enhance existing)

func validateMCPBContent(url, expectedName string) error {
    // Download bundle, extract (ZIP or tar.gz), find server.json
}

Implementation Priority & Complexity

Tier 1: Start Here (Low Complexity)

  1. NPM Repository URL validation - Check package.json repository field
  2. NuGet Repository URL validation - Check .nuspec repository field
  3. Basic content validation for NPM - .tgz extraction is straightforward

Tier 2: Medium Complexity

  1. PyPI validation - Handle both wheel and source distribution formats
  2. NuGet content validation - ZIP extraction + .nuspec parsing
  3. Enhanced MCPB validation - Add content validation to existing host checks

Tier 3: High Complexity (Consider Later)

  1. Docker/OCI validation - Complex multi-layer image handling
    • Recommend starting with specific registries (Docker Hub, GHCR)
    • Consider size limits (e.g., skip images >1GB)

Technical Implementation

Proposed Code Structure

// internal/validators/package_validation.go
type PackageValidator struct {
    config PackageValidationConfig
    client *http.Client
}

type PackageValidationConfig struct {
    MaxDownloadSize    int64         // e.g., 50MB limit
    DownloadTimeout    time.Duration // e.g., 30 seconds
    ExtractTimeout     time.Duration // e.g., 10 seconds  
    EnableContentCheck bool          // Phase 1: false, Phase 2: true
}

func (v *PackageValidator) ValidatePackageOwnership(pkg *model.Package, githubNamespace string) error {
    switch strings.ToLower(pkg.RegistryType) {
    case model.RegistryTypeNPM:
        return v.validateNPM(pkg, githubNamespace)
    case model.RegistryTypePyPI:  
        return v.validatePyPI(pkg, githubNamespace)
    case model.RegistryTypeNuGet:
        return v.validateNuGet(pkg, githubNamespace)
    case model.RegistryTypeOCI:
        return v.validateOCI(pkg, githubNamespace)  
    case model.RegistryTypeMCPB:
        return v.validateMCPB(pkg, githubNamespace)
    }
    return nil // Skip validation for unsupported types
}

Integration Point

Add to internal/api/handlers/v0/publish.go after existing validation:

// After line 55 (after validators.ValidatePublishRequest)
packageValidator := validators.NewPackageValidator(cfg.PackageValidation)
if err := packageValidator.ValidatePackageOwnership(input.Body.Packages, claims.GitHubNamespace); err != nil {
    return nil, huma.Error400BadRequest("Package ownership validation failed", err)
}

Security & Performance Considerations

Security Controls

  • Download size limits (prevent DoS with huge packages)
  • Timeout controls (prevent hanging on slow downloads)
  • Sandboxed extraction (extract in temporary directories)
  • Content type validation (verify expected file formats)

Performance Optimizations

  • Validation caching (cache results for package+version combinations)
  • Async validation (background jobs for content validation)
  • Concurrent validation (validate multiple packages in parallel)
  • Early termination (stop at first server.json found)

Error Handling Strategy

  • Graceful degradation: If validation service is down, allow publish but log warning
  • Retry logic: Retry failed downloads with exponential backoff
  • Clear error messages: Help users understand validation failures

Addressing the Core Problems

Problem 1: "Someone submitting a server.json that has a reference to a package they don't have access to"

  • Solution: Phase 1 repository URL validation catches most cases immediately
  • Benefit: Minimal publisher burden, high accuracy for legitimate packages

Problem 2: "Someone spamming the registry by submitting multiple instances of the same package"

  • Solution: Phase 2 content validation prevents package impersonation
  • Benefit: Strong security, but requires publishers to embed server.json

Problem 3: 5-10% accidental misconfigurations (vibe-coded package names)

  • Solution: Both phases address this through package existence + ownership validation
  • Benefit: Immediate feedback during publish prevents confusion

Recommendation

Start with Phase 1 (Repository URL Validation) for quick impact:

  1. Implement NPM and NuGet API validation first (easiest)
  2. Add PyPI validation
  3. Enhance MCPB validation
  4. Consider Docker/OCI validation later due to complexity

This approach provides immediate value with minimal publisher burden, then allows evolution toward stronger content-based validation as the ecosystem matures.

The research shows that implementing package validation across all registry types is definitely feasible, with varying levels of complexity. The phased approach I've outlined should address the core problems while being practical to implement and maintain.


claude[bot] avatar Sep 01 '25 21:09 claude[bot]

I missed to copy the thread I started on Discord around this -

Rado (Stacklok): 🧵 Linking the server.json to the referenced package entries [13:45]Rado (Stacklok): Hey, folks, I've been thinking about this so I decided to start a thread here and get some initial discussion started on an initial proposal. There are multiple ways to address this but I found the following to be simple and transparent enough from a user point of view.

It is solely based on: The fact that Sigstore is a well established open source solution for signing and verifying software artifacts, such as container images, packages like npm, pypi and much more.

The idea that I have in mind is the following (happy to brainstorm).

Currently: Upon publishing we verify the namespace ownership.

For example if I, rdimitrov, log in (or use the GH actions token) and try to publish an MCP entry referencing a repository under github.com/rdimitrov/... (or an org that I administrate?) we can allow/block this on the registry level, right?

Correct me if I'm wrong here as I haven't checked the auth code in details.

Assuming this is correct, the new changes are: Suggest to the authors to have their packages signed by Sigstore (i.e. docker images, npm, pypi and other registries that support Sigstore signatures). Note that this is already done by MCP server authors that care about verifying such provenance, i.e. the github MCP server is one example. Upon publishing the server.json, the registry will do what we already do (matching author to namespace) + it will go through the packages in server.json, search for their Sigstore signature, if it finds such and it's valid, extract the relevant metadata from it (issuer, identity, repository reference, etc?) and use that to compare with the repository/identity we verified in the beginning. Depending on what we decide we can allow/block publishing or if we allow things like that, have an additional property under x-io.modelcontextprotocol.registry indicating that package foo is indeed cryptographically linked to the publish author.

[14:03]Rado (Stacklok): Other things that can be considered (probably either obsolete/optional/redundant): If we want we can store a reference under the registry extensions to the Sigstore Rekor entry log so consumers can see which signature was used to ensure this validation (might be nice to have) Have the server.json be also signed by Sigstore and compare the identities of the two (json + package) are the same? (too much I think)

[01:19]Joel Verhagen (NuGet, Microsoft): For step 2, search for their sigstore signature, how is that done? Is that with a GitHub actions API? Asked another way who holds the mapping of package name X to Sigstore signature Y. Apologies if this is a basic question. I am not a Sigstore expert.

[01:25]Toby (GitHub): For reference, this is the issue where we've been discussing: https://github.com/modelcontextprotocol/registry/issues/96 GitHub Validate that submitted server.json's are referencing valid source ... We want to avoid the following situations: Someone submitting a server.json that has a reference to a package on npm which they do not have access to Someone spamming the registry by submitting mul...

[01:26]Toby (GitHub): I do think this is probably a good idea and solves for validating the ownership of the artifact in the package registries

[01:26]Toby (GitHub): it would be nice to have all of this wrapped up in a simple publication process though

[12:50]Rado (Stacklok): Joel Verhagen (NuGet, Microsoft) - it depends on the ecosystem, but in general there's a well defined way of finding the signature based on the corresponding package identifier. For example, for container images the signature is pushed along the container image and its identifier is the digest of the container image it refers to suffixed by .sig.

[12:54]Rado (Stacklok): Toby (GitHub) - Thanks for linking the issue! I'll make sure to provide a summary of this thread there 👍

By "publication process" do you mean it in the context of a SEP or like a design doc of the process of publishing to a registry server? Or I guess both 😄

[15:19]Daniele M. (dmartino): I imagine this would be customizable at least at registry level, right? E.g., each registry could have a different signature-verification provider, if any, and policies to allow unsigned servers or not. You are not envisioning changes in the embedded server definition, instead, right? (apart from properties under the x-io.modelcontextprotocol.registry section)

[15:45]Rado (Stacklok): Daniele M. (dmartino) - The PoC idea that I had in mind: Do not enforce this on the registry level because that would affect adoption and that's not okay. That may change in the future of course, but for the time being this is my understanding of the goals of the official registry (correct me if I'm wrong here) With that in mind, extend the registry extensions to have information about if the referenced package was successfully verified or not. That could be a simple bool or a more detailed struct with things like - reference to the signature, etc. Then it's up to the registry consumers (subregistries/clients) to decide how to act on that. Regarding the verification providers, I had only Sigstore in mind, but we can chat more if you have others in mind 👍

[18:00]Joel Verhagen (NuGet, Microsoft): NuGet already has a signing flow which is embedded signatures using x509 code signing certificates. I think using Sigstore is technically possible for NuGet since it is detached, but NuGet.org has no way of hosting or associating a Sigstore signature, so the MCP Registry would have no way of looking up the sigstore signature. I wonder what the "verification provider" would look like for different signature types. What is that "common interface" that abstracts away Sigstore vs. custom signing implementation X.

I am a little concerned that Sigstore is too opinionated so I definitely +1 a "verification provider" abstraction.

Since we are asking NuGet MCP authors to embed a .mcp/server.json in the package, perhaps the verification provider for NuGet is just looking for that embedded server.json and matching on package type = nuget and a matching package name.

[00:34]Toby (GitHub): Rado (Stacklok) I mean the act of an author publishing a server 😄

[12:34]Rado (Stacklok): Toby (GitHub) - got it 😄 yeah, my initial thinking is this should be transparent from the point of view of the author making the publication. Authors will publish the server.json as they do now, then the registry will try to find and verify if this package is coming from the same repository this author has rights to publish. If all goes good, the registry stores information about this verification in its metadata(the registry extensions). Then consumers will decide if and how to leverage that. Of course as with everything the devil is in the details so that can be confirmed once we start looking at the implementation 👍

Joel Verhagen (NuGet, Microsoft) - definitely agree on making this an interface 👍 Also I haven't checked how npm or pypi are making the signatures discoverable, but perhaps there's something there which NuGet can find inspiration from.

rdimitrov avatar Sep 02 '25 06:09 rdimitrov

Realised that reading files inside the packages can be quite hard to do reliably across all registries (especially Docker). Discussed with Tadas a bit this evening.

Desirable qualities from a solution:

  • validate the package at least exists
  • validate the package (at least intends to be) a real MCP server
  • prevent duplicate submissions for the same package, e.g. seo spam type risks

Some ideas that came out of the meeting:

  • This problem seems most acute for vibe coded npm and pypi packages in Tadas's experience, so maybe we could just focus on these?
  • For docker maybe we could use https://github.com/opencontainers/image-spec/blob/main/annotations.md instead?
  • For NPM we could ask people to put something in the package.json
  • If the above doesn't work for some reason, blocking publishing a package twice in the same namespace could be a good fallback

domdomegg avatar Sep 02 '25 19:09 domdomegg