specs icon indicating copy to clipboard operation
specs copied to clipboard

New spec: Request body canonicalization

Open tw4l opened this issue 1 year ago • 10 comments

POST canonicalization (or POST append) is implemented in pywb, warcio.js, and cdxj-indexer, but doesn't have a written specification.

This was first implemented in pywb in https://github.com/webrecorder/pywb/pull/636. Various implementations are described here: https://github.com/webrecorder/replayweb.page/issues/69#issuecomment-935194637

A proper specification would likely help in resolving issues such as https://github.com/webrecorder/pywb/issues/768 as well as generally making it clear to users and developers the rationale and expected behavior for POST canonicalization.

tw4l avatar Apr 20 '23 17:04 tw4l

I had an attempt at documenting it while writing a Java implementation.

http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/

There's some interesting quirks to it. JSON null is encoded as the string "None". It can also produce slightly different output when run on old versions of Python (< 3.7).

ato avatar Jun 08 '23 07:06 ato

Thank you @ato , this is very helpful! Documenting the quirks especially - I'll have to look into the Python < 3.7 issues!

tw4l avatar Jun 12 '23 14:06 tw4l

We will be modifying pywb and warcio.js to be consistent - necessary issues have been opened. Notably, we'll be making pywb use native JSON values rather than Pythonic values (e.g. True, None) in the canonicalized URL.

We'll need to make sure that this isn't a breaking change and that already-canonicalized URLs created by pywb will either continue to work with pywb and wabac.js's fuzzy matching (preferred) or that we have a conversion process available.

tw4l avatar Jul 26 '23 20:07 tw4l

@tw4l I have the modified outbackcdx from @ato using index version 5 running and indexed a warc using cdx-indexer -j -p. The urlkeys in outbackcdx look good for POST request (they include __wb_method as well as the parameters as described in atos draft at http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/).

However, I can't get the replay of POST requests working. It seems the __wb_method and value parameters are not getting through to outbackcdx (latest pywb 2.7.4) in the request from pywb. Before looking too far - did I miss something in the config? (I'm using index_paths: cdx+http://outbackcdx:8080/nb-webarchive)

kaij avatar Aug 17 '23 10:08 kaij

Pywb needs this patch to pass them through: https://github.com/nla/pywb/commit/2bb97fc7081a3260d6fdf7f2d248e0dd51dd6129

ato avatar Aug 17 '23 10:08 ato

It works very nicely now, thanks @ato!

Do you see any problems for using index version 5 in production? (besides having to recreate the index if the spec changes at a later time)

kaij avatar Aug 17 '23 11:08 kaij

We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a bit more polishing.

ato avatar Aug 17 '23 11:08 ato

We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a bit more polishing.

Sounds good! Mentioning upgrade... so there is a way to upgrade the index? (so far I only used a newly created index) It would of course simplify things a lot.

kaij avatar Aug 17 '23 11:08 kaij

I've written up some notes about upgrading here: https://github.com/nla/outbackcdx/issues/117 If you have any further OutbackCDX questions please post them there as we've drifted off the topic of the POST canonicalization spec. :-)

ato avatar Aug 17 '23 12:08 ato

@ato PR is in and would love your eyes on it if you have the bandwith: https://github.com/webrecorder/specs/pull/149

tw4l avatar Apr 04 '24 19:04 tw4l