warehouse
warehouse copied to clipboard
Secret reporting endpoint
What's the problem this feature will solve?
As part of our ongoing collaboration to find exposed secrets in PyPI packages, we are working on a scanning pipeline that automatically scans newly released packages. In order to report our findings, we will need an endpoint we can call, with an agreed-upon schema.
Describe the solution you'd like
Schema
Ideally, the endpoint’s payload would be on a per artifact basis, allowing us to include metadata about the artifact alongside the list of secrets that were found. Here is a possible schema for the payload.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Artifact scanning report",
"description": "The detail of all the findings for a given artifact",
"type": "object",
"required": [
"release",
"scan_info",
"scan_results"
],
"properties": {
"release": {
"type": "object",
"required": [
"title",
"package_name",
"version"
],
"properties": {
"title": {
"type": "string",
"examples": [
"ggshield 1.0.2"
]
},
"package_name": {
"type": "string",
"examples": [
"ggshield"
]
},
"version": {
"type": "string",
"examples": [
"1.0.2"
]
}
}
},
"scan_info": {
"type": "object",
"required": [
"scanner_version",
"scanned_at"
],
"properties": {
"scanner_version": {
"type": "string",
"examples": [
"2.99.0"
]
},
"scanned_at": {
"type": "date-time",
"examples": [
"2023-11-16T17:10:25Z"
]
}
}
},
"scan_results": {
"type": "array",
"items": {
"type": "object",
"required": [
"artifact",
"secrets"
],
"properties": {
"artifact": {
"type": "object",
"required": [
"name",
"sha256_digest"
],
"properties": {
"name": {
"type": "string",
"examples": [
"ggshield-1.0.2.zip"
]
},
"sha256_digest": {
"type": "string",
"examples": [
"13550350a8681c84c861aac2e5b440161c2b33a3e4f302ac680ca5b686de48de"
]
}
}
},
"secrets": {
"type": "array",
"items": {
"type": "object",
"required": [
"detector_name",
"detector_display_name",
"company_name",
"filepath",
"matches",
"validity_status"
],
"properties": {
"detector_name": {
"type": "string",
"examples": [
"google_aiza"
]
},
"detector_display_name": {
"type": "string",
"examples": [
"Google API Key"
]
},
"company_name": {
"type": "string",
"examples": [
"Google"
]
},
"documentation_url": {
"type": "uri",
"examples": [
"https://docs.gg.com/google_aiza"
]
},
"filepath": {
"type": "string",
"examples": [
"/ggshield/connect/google.py"
]
},
"matches": {
"type": "array",
"items": {
"type": "object",
"required": [
"match_name",
"index_start",
"index_end"
],
"properties": {
"match_name": {
"type": "string",
"examples": [
"apikey"
]
},
"index_start": {
"type": "integer",
"examples": [
12
]
},
"index_end": {
"type": "integer",
"examples": [
32
]
}
}
}
},
"validity_status": {
"type": "string",
"enum": [
"NO_CHECKER",
"FAILED_TO_CHECK",
"VALID",
"INVALID"
],
"examples": [
"VALID"
]
}
}
}
}
}
}
}
},
"examples": [
{
"release": {
"title": "ggshield 1.0.2",
"package_name": "ggshield",
"version": "1.0.2"
},
"scan_info": {
"scanner_version": "2.99.0",
"scanned_at": "2023-11-16T17:10:25Z"
},
"scan_results": [
{
"artifact": {
"name": "ggshield-1.0.2.zip",
"sha256_digest": "13550350a8681c84c861aac2e5b440161c2b33a3e4f302ac680ca5b686de48de"
},
"secrets": [
{
"detector_name": "google_aiza",
"detector_display_name": "Google API Key",
"company_name": "Google",
"documentation_url": "https://docs.gg.com/google_aiza",
"filepath": "/ggshield/connect/google.py",
"matches": [
{
"match_name": "apikey",
"index_start": 12,
"index_end": 32
}
],
"validity_status": "VALID"
}
]
}
]
}
]
}
Response
We do not expect the endpoint to return any data, we just need to be able to distinguish between a successful call and one that fails: standard status codes should be more than enough.
API versioning
We have no strong requirement on this point, and will be fine with whichever solution you choose for the versioning of the schema.
Call volume and rate limiting
Since we are planning to call the endpoint once per artifact in which we find secrets, the worst case would be that we find secrets in every single artifact. In that case, our volume of calls would be directly proportional to the number of releases. We consequently don’t expect our volume of calls to be such as to restricted by rate limiting.
Authentication
This endpoint should not be publicly available. A possible approach would be to use both authentication via a secret (ideally just an API key) and an IP allowlist, to guarantee that only known entities have access to the endpoint.
Remediation
In the case of prolonged downtime of the endpoint, we won’t be able to upload our findings. They will be persisted on our end, and can be re-uploaded at a later point. We do not plan to have a way to automate this: this will be done “manually”, on an ad-hoc fashion.
We would also probably need to have an automated way of revoking / renewing our own API key, to be able to remediate any leak on our end immediately.
I have modified the proposed schema. The gist of the change is that the schema now includes the result of a scan of an entire release, not just of a given artifact.
Thanks for the patience on this, I've taken a look now and have some thoughts:
@miketheman is working on developing the infrastructure for the Malicious Package Reporting API and much of that work will be required for this API endpoint as well (authentication, observation model, etc) so that work will need to be completed before we can implement the API endpoint itself.
For our own uses we'll need to decide a "minimum" set of required fields that will actually get used by our backend and then all other fields can be sent as additional information that we might use later on. The nice thing about the Observation model that Mike's designed is that we can gather all the information and then choose to use more later down the road, so don't let our small number of required fields discourage sending more information in the payload.
Identifying some straightforward required fields we'll likely need:
- Project name (determine maintainers to send email to)
- Release version (for Inspector URL, email)
- File name (for Inspector URL)
- Line number (for use to create an Inspector URL)
- Secret type/provider (for email prose, ie "Google Cloud API Key")
- Secret status (for alerting user what action has been taken or if they need to take additional action)
- Secret documentation URL (for providing more information about the secret)
- Secret scanning "report issue" URL (Provide a mechanism for scanners to be alerted to false-positives)
Since we're applying these observations to individual files, not necessarily to releases, we might want to have the API endpoint be file-centric as well? Something along the lines of:
{
"scanner_info": {
"display_name": "GitGuardian",
"report_issue_url": "..."
},
"scan_results": [
{
"filename": "urllib3-2.0.3.tar.gz",
"digests": {
"sha256": "..."
},
"secrets": [
{
"type": "google_aiza",
"display_name": "Google API Key",
"filepath": "",
"line": 1
"documentation_url": "..."
"validity_status": "VALID|NOT_VALID|UNKNOWN"
}
]
}
]
}