warehouse icon indicating copy to clipboard operation
warehouse copied to clipboard

Lossless encoding of XML-RPC data

Open wayphinder opened this issue 2 years ago • 4 comments

What's the problem this feature will solve? The _clean_for_xml function removes some illegal characters. https://github.com/pypi/warehouse/blob/496338e94d6d62811671e7754507d3d8bc3942c0/warehouse/legacy/api/xmlrpc/views.py#L83-L93

This makes it harder to correlate this information with other sources. E.g. the action field contains filenames, that might not match the actual filename because some characters are removed.

Describe the solution you'd like Base64 or otherwise encode relevant fields in a way that does not remove data.

Additional context

wayphinder avatar Aug 16 '23 17:08 wayphinder

Some previous context: https://github.com/pypi/warehouse/issues/5653

woodruffw avatar Aug 16 '23 17:08 woodruffw

Other context: I'm talking about this with @wayphinder in person. It sounds like the main place where this causes problems for him is in the changelog_since_serial endpoint, where e.action gets munged:

https://github.com/pypi/warehouse/blob/496338e94d6d62811671e7754507d3d8bc3942c0/warehouse/legacy/api/xmlrpc/views.py#L474

My first thought here was to add another member to the end of the list that gets returned here, essentially trading a bit of extra response size for probably not breaking compatibility (since the list will only strictly increase in size, and pre-existing fields won't change). But that might also cause issues that I'm not aware of.

woodruffw avatar Aug 16 '23 17:08 woodruffw

The primary known/supported use-case for this endpoint is ~~PEP-381 and its most prominent implementation~~ bandersnatch.

bandersnatch currently consumes changelog_since_serial in a way that would not choke on the proposed fix (adding another member to the end of the list): https://github.com/pypa/bandersnatch/blob/b3517c5acf696008da0ecd9544a4823a676191d1/src/bandersnatch/master.py#L207-L216

But in general I'm very hesitant to wake the XMLRPC dragon as we currently support it only for mirroring support and do not intend to take on new support for its use.

ewdurbin avatar Aug 16 '23 18:08 ewdurbin

While changing the XML-RPC API would be great, for my use case a one-time dump of the current data in a lossless format would also work. A lot of the same data should be available in the BigQuery data set, but my understanding is that some historic data is missing, which is why I would like the changelog data.

wayphinder avatar Aug 17 '23 11:08 wayphinder