packaging icon indicating copy to clipboard operation
packaging copied to clipboard

Add an `escape_name()` function

Open brettcannon opened this issue 3 years ago • 7 comments

See https://github.com/pypa/packaging/issues/527 for the motivation. Basically utils.canonicalize_name(name).replace("-", "_") to match PEP 503 and 427 for project name normalization for sdists and wheels.

brettcannon avatar May 05 '22 19:05 brettcannon

I think simple replace is insufficient, see Escaping section in PEP427:

Each component of the filename is escaped by replacing runs of non-alphanumeric characters with an underscore _:

This means that multiple characters should escape to a single _. utils.parse_wheel_name will even raise an exception if there's a double underscore anywhere in the name..

MrMino avatar May 06 '22 21:05 MrMino

I think simple replace is insufficient, see Escaping section in PEP427:

Each component of the filename is escaped by replacing runs of non-alphanumeric characters with an underscore _:

The runs of symbols is taken care of by canonicalize_name().

brettcannon avatar May 09 '22 23:05 brettcannon

There's one tiny bit of ambiguity here, PEP 427 escaping doesn't require canonicalizing the name, while PEP 625 does. However, in practice the practical difference between those two statements is whether or not the output is lowercase or not. However, PEP 427 doesn't treat names as case sensitive, so there's no requirement to preserve case so I think it's perfectly fine to do that.

Actually, the other practical difference is that PEP 427 normalize leaves . in a section whereas this implementation does not allow that. I still don't think that matters exactly, but it does mean that the PEP 427 normalization function can work on any segment (name, version, compat tags, whatever), whereas the above implementation can only work on names.

I think that's OK, since the name of the function is escape_name(). Though it does raise the question if it makes sense to do something like:

import re

def escape_filename_segment(segment):  # better name?
    return re.sub("[^\w\d.]+", "_", segment, re.UNICODE)

Since that would work for all cases.

If not, does it make sense to add escape_version (which may or may not normalize version numbers? I'm not sure what makes sense)?

Dunno, I think all of the options are fine, just spelling out the edge cases.

dstufft avatar May 10 '22 03:05 dstufft

it does mean that the PEP 427 normalization function can work on any segment (name, version, compat tags, whatever), whereas the above implementation can only work on names.

I think that's OK, since the name of the function is escape_name().

That was my thinking: only care about a portion of the file when e.g. constructing an sdist file name.

it does raise the question if it makes sense to do something like:

import re

def escape_filename_segment(segment):  # better name?
    return re.sub("[^\w\d.]+", "_", segment, re.UNICODE)

Since that would work for all cases.

Interesting idea. I just don't know how often people are escaping the entire file at the end versus as they go.

If not, does it make sense to add escape_version (which may or may not normalize version numbers? I'm not sure what makes sense)?

I don't think so? But I don't have a good reason to think otherwise.

brettcannon avatar May 11 '22 22:05 brettcannon

Thinking about this more, maybe it would be better to expand the canonicalize_name() docs to say it can be used to escape a name with canonicalize_name(name).replace('-', '_')? I'm not sure if maintaining a separate function for such a niche thing is worth it as long as it's documented clearly how to produce the results? At least with canonicalize_name() the regex can be messed up and the type annotation on the return type is useful. But the return type for an escaped name isn't really there since it isn't something you would normally pass around as much as produce and then immediately consume.

brettcannon avatar Jun 29 '22 20:06 brettcannon

Or if we’re to add an API, I’d propose having something like ensure_valid_filename_component that also performs validation (i.e. makes sure the input is a valid packaging identifier to begin with).

uranusjr avatar Jun 30 '22 05:06 uranusjr