ipwb icon indicating copy to clipboard operation
ipwb copied to clipboard

Allow running replay without an explicit index

Open ibnesayeed opened this issue 7 years ago • 3 comments

Now that we have a way to upload WARCs from the admin interface (#436), an index at the replay startup should not be mandatory anymore. New replay CLI should behave line this:

  • ipwb replay should start replay with a randomly generated empty index that only contains metadata
  • ipwb index some.warc | ipwb replay should utilize the resulting index received from the pipe for replay
  • ipwb replay sample.cdxj should use sample.cdxj for replay

ibnesayeed avatar Aug 15 '18 02:08 ibnesayeed

Related: #504

machawk1 avatar Aug 15 '18 03:08 machawk1

I started coding this then realized I was resurfacing old CDXJ /tmp/ generation code.

In __main__.py's else stating that an index is required we can call generateCDXJMetadata() to get a string but replay.py's start() expects a path to the index. Rather than creating a temp file, perhaps we can modify start() to optionally take a string but the current function seems to be heavily reliant on the expectation that the index will be somewhere and not passed along.

Which route do you want to go, @ibnesayeed?

machawk1 avatar Aug 17 '18 03:08 machawk1

The fix should be simple and some code branches from the checkArgs_replay might go away. Here is how I envision the new pseudo code:

def checkArgs_replay(args):
    if not args.index: # irrespective of pipe or lack thereof
        args.index = generate_random_index_file_path()
    if pipe and pipe_data:
        write_pipe_data_to_index()
    # fix proxy as usual
    replay.start(cdxjFilePath=args.index, proxy=proxy)

ibnesayeed avatar Aug 17 '18 03:08 ibnesayeed