pywb icon indicating copy to clipboard operation
pywb copied to clipboard

Indexing Errors with YouTube JSON in POST Request Payload

Open mona-ul opened this issue 2 years ago • 0 comments

Describe the bug

When using pywb (wb-manager reindex, cdx-indexer) and cdxj-indexer a WARC file can’t get indexed. All indexing methods return an error. (“Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte”)

WARC file: https://storage.googleapis.com/rhizome-hurl/stuttering-by-nathaniel-stern-20230821133816.warc

The WARC Record causing problems seems to be a POST Request, with a payload containing query data in JSON. Identified WARC Records causing the error:

  • request: urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90

    • WARC-Target-URI: https://www.youtube.com/youtubei/v1/log_event?alt=json&key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8
    • Payload: request_payload_utf8.txt
  • response: urn:uuid:78218a84-3c12-11ee-804f-0242c0a89008

    • WARC-Target-URI: https://www.youtube.com/youtubei/v1/log_event?alt=json&key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8
    • Payload: response_payload_utf8.txt

cdxj-indexer Error Message

cdxj-indexer -p [warc file] > [index file] 
Error parsing: {"context":{"client":{"hl":"en","gl":"US","clientName":1,"clientVersion":"2.20230815.00.00","configInfo": [...]

The error refers to the payload of the Request Record urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90. (Full payload attached above)

wb-manager reindex Error Message

wb-manager reindex [collection]
Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

cdx-indexer Error Message

cdx-indexer -p [WARC file]
[...]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mona/.local/bin/cdx-indexer", line 8, in <module>
    sys.exit(main())
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 468, in main
    write_multi_cdx_index(cmd.output, cmd.inputs,
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 306, in write_multi_cdx_index
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 342, in __call__
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 215, in join_request_records
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 188, in create_record_iter
    post_query = MethodQueryCanonicalizer(method,
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/warcserver/inputrequest.py", line 281, in __init__
    sys.stderr.write("Ignoring query, error parsing as json: " + query.decode("utf-8") + "\n")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Steps to reproduce the bug

  • Download the WARC file: https://storage.googleapis.com/rhizome-hurl/stuttering-by-nathaniel-stern-20230821133816.warc
  • use described tools and commands to index the file

Environment

  • OS: Ubuntu 22.04
  • Version pywb 2.7.4
  • Version cdxj-indexer 1.4.5
  • Version warcio 1.7.4

Additional context

Identification of Error Records

When using the indexing methods a index.cdxj is partly written. Comparing the entries of the WARC and index file chronologically, the first entry in the WARC file that is not in the index file was identified as the record causing problems. To verify that this record is causing problems, it was removed using warcio. After that, the indexing worked.

WARC-Processing with warcio

The WARC file and the identifies error records were processed with warcio and no utf-8 occurred.

from warcio.archiveiterator import ArchiveIterator
import sys

warc1_path = sys.argv[1]

from warcio.archiveiterator import ArchiveIterator

with open(warc1_path, 'rb') as stream:
    for i, record in enumerate(ArchiveIterator(stream)):
        print(i, record.rec_headers.get_header('WARC-Target-URI'))
        print(i, record.rec_headers.get_header('WARC-Record-ID'))
        if record.rec_type == 'request':
            content = record.content_stream().read()
            print(content.decode('utf-8'))

mona-ul avatar Oct 26 '23 15:10 mona-ul