pwebarc Web UI for searching

Hi! I am loving hoardy-web so far :)

I have got hoardy-web serve up and running successfully and serving archived websites, but one feature that would be wonderful to have is a search UI. Are there any plans to implement one, or would a PR with such a feature, tastefully implemented of course, be likely to be accepted?

It also looks like the finding and filtering functionality in map_wrr_paths is unindexed, which would definitely affect the speed of such a search interface; AFAICT these read all files on disk for each query. I am thinking of throwing some indices into a sqlite db in the root of each data store, but not sure if you've already got plans in this area?

Dec 30 '24 15:12 aidanholm

Hi! Thanks for kind words!

Yes, currently serve has no full-text search and all reqres filtering with find and such is done without indexes. And yes, indexes are planned, eventually.

To elaborate a little, I actually have another, yet unpublished, bunch of scripts which I plan to rework into a single-command app I will probably simply call hoardy.

Those scripts mainly do file de-duplication (a-la fdupes, but with an index), set operations on directories (e.g. "get me paths to all unique files in this directory (ignoring duplicates)", "list files common to these three directories", "list files present in one of these two directories but missing from the third", etc), and syncing of those sets across disks and hosts ("copy all files present in this directory and matching this filter to another host if they are not present there and also not present on yet another host") efficiently (not linearly, like rsync, but with Merkle-trees over indexes).

My current plan is to rework and publish that first, then split out my indexing code from there into a separate library (or put it into kisstdlib, maybe), which I would then reuse in hoardy-web to add indexing here.

Also, realistically, a good search page needs to be completely asynchronous and hoardy-web serve is completely synchronous at the moment, so I also need to cleanup and publish my KISS asyncio modules to kisstdlib (I hate the standard asyncio, sorry, not sorry) and either find a compatible HTTP protocol parser or write my own first...

So, it will probably take awhile to do this properly.

But, I suppose, if this is super-important to you, I would not be completely opposed to accepting a hacky implementation with a simple sqlite index, with an understanding that this will be re-implemented in the future, and the DB format will not be compatible with the future version and everything will need to be re-indexed.

(Though, if you plan to do this, then please wait a couple of days before starting, because I have a huge change re-formatting everything with black and then a bunch of whole-repo edits fixing many pylint warnings. I'm currently debugging these changes, because they unexpectedly broke tests, and I'm trying to figure out why as we speak.)

Dec 30 '24 17:12 oxij

... I have a huge change re-formatting everything with black and then a bunch of whole-repo edits fixing many pylint warnings. I'm currently debugging these changes, because they unexpectedly broke tests, and I'm trying to figure out why as we speak. ...

This bit is done now.

Dec 31 '24 15:12 oxij

Thanks for the quick and detailed reply (and sorry for the slow response)

Also, realistically, a good search page needs to be completely asynchronous

I'm not sure what you mean by this, could you explain?

I haven't fully fleshed out the whole search http query architecture, but one possible way is to have requests return a fixed number of results (e.g. 100) along with a continuation token -- if sqlite can cough up 100 results at a time in (and this depends on the indexing structure used as well of course, but 10ms--100ms should be doable) then long-running queries would be broken up into many individual requests, which would prevent one long query from stalling the entire server for everyone

I am likely also missing background context on how the hoardy-web web server is deployed; e.g. I would assume that it's intended for a small number of occasional users, and that a bit of tail latency from concurrent search requests is no big deal, but I could be totally wrong there

But, I suppose, if this is super-important to you, I would not be completely opposed to accepting a hacky implementation with a simple sqlite index, with an understanding that this will be re-implemented in the future, and the DB format will not be compatible with the future version and everything will need to be re-indexed.

Sounds reasonable to me :) I'll see what I can come up with

Though, if you plan to do this, then please wait a couple of days before starting, because I have a huge change re-formatting everything with black ... This bit is done now.

nice :)

Jan 17 '25 07:01 aidanholm

Also, realistically, a good search page needs to be completely asynchronous

I'm not sure what you mean by this, could you explain?

I haven't fully fleshed out the whole search http query architecture, but ...

I mean, the problem with any broken-up-with-HTTP-continuations synchronous design is that HTTP requests themselves will still be processed synchronously, so while the search is generating its 100 results, the archiving will stop working.

You can put the search into a separate OS thread, and query its state periodically instead, I suppose, but then it's hard to know when that thread should stop if the users closes the relevant page.

A good async implementation would use WebSockets to get return search results, solving both issues.

Jan 17 '25 10:01 oxij

while the search is generating its 100 results, the archiving will stop working

Ah yes this is true, but I am less concerned for now, since

archiving requests are not latency sensitive; as long as throughput isn't unduly affected, a tail latency increase is not ideal but also likely unnoticable in practice
there are workarounds like running multiple workers with uwsgi / gunicorn / etc (if, as I believe, the server is "stateless"), or running separate servers, one for archival requests and one for search requests
it's not yet clear what query latencies would be like in practice, so maybe they'd be low enough to not have to handle specially, and
if you're planning a sync -> async change, that's something that can likely be done without an extra search endpoint making it more difficult

So I will focus for now on getting something that works, and reconsider this once I have a clearer idea of actual performance dynamics

You can put the search into a separate OS thread, and query its state periodically instead, I suppose, but then it's hard to know when that thread should stop if the users closes the relevant page.

Possible! That thread could be limited to precomputing the next N results, with an eventual timeout -- but the first request must either be done synchronously or have special handling, which would be nice to avoid if possible

A good async implementation would use WebSockets to get return search results, solving both issues.

I'm not familiar with how websockets would work with a Flask server -- as I understand it, websockets are incompatible with the WSGI protocol, so they'd presumably need some extra handling with the actual HTTP server bit, which might complicate your plan to change to async. But if you are strongly in favour of a websocket implementation, it should also be not too much work to change a standard flask route based implementation to use websockets, once that standard implementation exists :)

Jan 18 '25 03:01 aidanholm

while the search is generating its 100 results, the archiving will stop working

archiving requests are not latency sensitive; as long as throughput isn't unduly affected, a tail latency increase is not ideal but also likely unnoticable in practice

Depends on the search speed, I suppose. Having thousands of reqres waiting in extension memory would be annoying.

there are workarounds like running multiple workers with uwsgi / gunicorn / etc (if, as I believe, the server is "stateless"), or running separate servers, one for archival requests and one for search requests

The server is stateful since archiving -> dump parsing -> indexing is stateful.

if you're planning a sync -> async change, that's something that can likely be done without an extra search endpoint making it more difficult

Yes, which is why I put it off for later. :)

So I will focus for now on getting something that works, and reconsider this once I have a clearer idea of actual performance dynamics

Meanwhile, I'm actively working on cleaning up and publishing my file indexer.

I'm not familiar with how websockets would work with a Flask server

hoardy-web uses Bottle, not Flask, Flask is too complex for me. (Bottle is too, a bit. I would prefer a bare-HTTP "framework" with request dispatch instead of wrappers over WSGI/CGI/FCGI. But it is the simplest thing I know of, ATM, so hoardy-web uses it. Yes, I'm very opinionated.)

-- as I understand it, websockets are incompatible with the WSGI protocol, so they'd presumably need some extra handling with the actual HTTP server bit, which might complicate your plan to change to async.

WebSockets: you make an HTTP request, it ends with "101 Switching Protocols", and the rest of the connection is now a WebSockets connection. WebSockets protocol is, basically, message-based TCP, i.e. guaranteed order and delivery, but not a plain byte stream, but separate typed messages. But, since it's bidirectional, both sides can notice when the other disconnects or just stops working (there's a PING message type).

So, as to your statement: not really, search would simply spawn a separate thread (OS or async, does not matter) and quietly work away, talking to its own WebSocket. And immediately stop if that socket dies.

Jan 18 '25 07:01 oxij

The server is stateful since archiving -> dump parsing -> indexing is stateful.

This is only stateful because the serve path's index is maintained in memory (in the SortedIndex) right?

Meanwhile, I'm actively working on cleaning up and publishing my file indexer.

Of course you may be planning to go in a completely different direction~ but just for reference, I have some prototype code implementing that interface via sqlite on disk in https://github.com/aidanholm/hoardy-web/commit/33b8b9d9d08a; this is still hacked together and assumes wrr only, only one archival dir, etc, but seems to be working nicely given those constraints

With a very low-effort schema, I got around a 1.4MB database for 275 MB of (hopefully representative) wrr files, so around 1.5% overhead, serving startup is now "instant", and IIUC this would make the server stateless

hoardy-web uses Bottle, not Flask, Flask is too complex for me.

I can only agree :) I've used flask a fair bit at $WORK and found it deceptively simple (haven't tried Bottle yet) so this stance makes complete sense to me

Jan 18 '25 09:01 aidanholm

The server is stateful since archiving -> dump parsing -> indexing is stateful.

This is only stateful because the serve path's index is maintained in memory (in the SortedIndex) right?

Yes, but even if it were not, replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.

I want those buttons to work even if the tab in question is not yet fully fetched, they would wait for everything to fetch and get archived, and then immediately switch to the replay. Which needs the replay to be synchronous with archival.

I have some prototype code implementing that interface via sqlite on disk in https://github.com/aidanholm/hoardy-web/commit/33b8b9d9d08a; this is still hacked together and assumes wrr only, only one archival dir, etc, but seems to be working nicely given those constraints

Yes, this is basically what I expect it would look like. (Also, SortedIndex clearly needs a generic interface.)

With a very low-effort schema, I got around a 1.4MB database for 275 MB of (hopefully representative) wrr files, so around 1.5% overhead, serving startup is now "instant", and IIUC this would make the server stateless

Your implementation is cute, but this won't work for WRR-bundles and etc, as those need to be sub-file. Also, full-text indexing will need indirections to deduplicate indexing of same-content data. The complete version won't be as cute, unfortunately.

Jan 19 '25 06:01 oxij

replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.

I'm not sure I If I understand correctly; do you mean that if there is an ongoing dumping request for a URL, then any replay request for that same URL should block until the dump completes?

Wouldn't this also require the replay server and the archival/dumping server to share the index (or share at least some state)?

this won't work for WRR-bundles and etc, as those need to be sub-file.

IIUC a wrr bundle is basically wrr bundles directly byte-concatenated? An offset column could be added (and a size column if it cannot be inferred by the wrrb loader)

Also, full-text indexing will need indirections to deduplicate indexing of same-content data.

I am currently playing around a bit with sqlite's full text search -- it is possible to index text without storing a copy of the indexed content, and indexing response bodies only for text/http response content-types results in reasonably small indexes even without doing any content deduplication; I got index sizes of about 4% of the data store

Jan 19 '25 08:01 aidanholm

replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.

I'm not sure I If I understand correctly; do you mean that if there is an ongoing dumping request for a URL, then any replay request for that same URL should block until the dump completes?

No, that's a bit too strict. (And I'm not sure how one could make hoardy-web serve ever guarantee that.)

But, as a client, if you dump a new visit for a URL, the server says 200 OK, and hence you immediately ask for a replay of this same URL, the server should replay the latest version, not some version from before.

Wouldn't this also require the replay server and the archival/dumping server to share the index (or share at least some state)?

Yes, but that's kind of the point of having them in the same process.

Which is why it needs to be properly async.

Jan 19 '25 09:01 oxij

But, as a client, if you dump a new visit for a URL, the server says 200 OK, and hence you immediately ask for a replay of this same URL, the server should replay the latest version, not some version from before.

Ah I see, makes sense

Yes, but that's kind of the point of having them in the same process.

I am currently running hoardy_web_sas.py separately to hoardy_web serve; not sure how supported this configuration is in general, but I'd guess the extension's replay button feature would only work / be available when connected to a server with both archival and serving enabled?

Jan 19 '25 11:01 aidanholm

I am currently running hoardy_web_sas.py separately to hoardy_web serve; not sure how supported this configuration is in general,

It would work fine if you disable capture before going to a replay URL, otherwise replays would get archived too.

but I'd guess the extension's replay button feature would only work / be available when connected to a server with both archival and serving enabled?

Correct.

Jan 19 '25 15:01 oxij

pwebarc pwebarc copied to clipboard

Web UI for searching

pwebarc
pwebarc copied to clipboard