purl-spec icon indicating copy to clipboard operation
purl-spec copied to clipboard

Keep Hugging Face PURL version component case as-is

Open maitre-matt opened this issue 2 years ago • 3 comments
trafficstars

Update test data so version component is not lower-cased

maitre-matt avatar Jan 05 '23 00:01 maitre-matt

I believe this will cause implementations that respect the following rule in the PURL-TYPES spec to fail:

The version is the model revision Git commit hash. It is case insensitive and must be lowercased in the package URL.

edit: Or hm I might've spoke too soon, the above probably only applies to the canonical_url? The version component itself is parsed as-is?

williamboman avatar Jan 05 '23 00:01 williamboman

I am not sure: I am not clear whether the version JSON field is about the canonical URL or the original URL. The test could go either way.

maitre-matt avatar Jan 05 '23 00:01 maitre-matt

FWIW this breaks the test case in an implementation I wrote. I'm not implying that my implementation is correct (it's merely my interpretation of this particular part of the spec). Thought I'd share!

williamboman avatar Jan 05 '23 00:01 williamboman

@maitre-matt I don't understand the reason you want the alpha characters in the huggingface version value to be changed from lowercase to uppercase -- the type definition says

The version is the model revision Git commit hash. It is case insensitive and must be lowercased in the package URL.

This appears to be language you added via PR 201.

johnmhoran avatar Jun 04 '25 00:06 johnmhoran

Forgot to add the link to the type definition: https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst#huggingface.

johnmhoran avatar Jun 04 '25 00:06 johnmhoran

@johnmhoran: in PR #201 there was a report that the lower-case string in the version JSON field was breaking Go and Python implementations at the time: https://github.com/package-url/purl-spec/pull/201/files#r1061825523

This PR keeps the version segment in canonical_purl lower-case, matching the part of the spec quoted above.

Whether the version in canonical_purl and the string version should match, I am not sure. Above it was mentioned this PR was breaking other implementations. Given this PR has been opened for 2+ years, it's probably a safe bet it can be discarded.

maitre-matt avatar Jun 04 '25 03:06 maitre-matt

@maitre-matt Thank you for your reply -- now I see your point.

It's not clear to me -- from the spec and examples and test objects -- how all the individual values are meant to be calculated once one has the test input PURL. For me this is a bit hazy in various situations including when "is_invalid" is set to true. See, e.g., https://github.com/package-url/purl-spec/issues/419.

And I think this section needs clarification with details and some examples: https://github.com/package-url/purl-spec/blob/main/PURL-SPECIFICATION.rst#tests

johnmhoran avatar Jun 04 '25 04:06 johnmhoran

Hey two things:

  1. HF hashes are always lower case, so the spec looks fine to me. For instance: https://huggingface.co/CIRCL/vulnerability-severity-classification-roberta-base/commit/85a234292acc526d81ef91f74d0dc5ccfd6ac016

  2. After the merge of PR #514, PURL types are now defined in JSON:

  • #514

With the new approach... we have crisper definition of tests and decoded vs canonical expectations.... Here IMHO this should stay lowercase.

Thanks for your understanding and patience

See also:

  • https://github.com/package-url/purl-spec/pull/534

@williamboman what's your implementation? I would love to help to implement the new test approach and spec refinements!

pombredanne avatar Jul 26 '25 15:07 pombredanne