pwebarc
pwebarc copied to clipboard
Web UI for searching
Hi! I am loving hoardy-web so far :)
I have got hoardy-web serve up and running successfully and serving archived websites, but one feature that would be wonderful to have is a search UI. Are there any plans to implement one, or would a PR with such a feature, tastefully implemented of course, be likely to be accepted?
It also looks like the finding and filtering functionality in map_wrr_paths is unindexed, which would definitely affect the speed of such a search interface; AFAICT these read all files on disk for each query. I am thinking of throwing some indices into a sqlite db in the root of each data store, but not sure if you've already got plans in this area?
Hi! Thanks for kind words!
Yes, currently serve has no full-text search and all reqres filtering with find and such is done without indexes.
And yes, indexes are planned, eventually.
To elaborate a little, I actually have another, yet unpublished, bunch of scripts which I plan to rework into a single-command app I will probably simply call hoardy.
Those scripts mainly do file de-duplication (a-la fdupes, but with an index), set operations on directories (e.g. "get me paths to all unique files in this directory (ignoring duplicates)", "list files common to these three directories", "list files present in one of these two directories but missing from the third", etc), and syncing of those sets across disks and hosts ("copy all files present in this directory and matching this filter to another host if they are not present there and also not present on yet another host") efficiently (not linearly, like rsync, but with Merkle-trees over indexes).
My current plan is to rework and publish that first, then split out my indexing code from there into a separate library (or put it into kisstdlib, maybe), which I would then reuse in hoardy-web to add indexing here.
Also, realistically, a good search page needs to be completely asynchronous and hoardy-web serve is completely synchronous at the moment, so I also need to cleanup and publish my KISS asyncio modules to kisstdlib (I hate the standard asyncio, sorry, not sorry) and either find a compatible HTTP protocol parser or write my own first...
So, it will probably take awhile to do this properly.
But, I suppose, if this is super-important to you, I would not be completely opposed to accepting a hacky implementation with a simple sqlite index, with an understanding that this will be re-implemented in the future, and the DB format will not be compatible with the future version and everything will need to be re-indexed.
(Though, if you plan to do this, then please wait a couple of days before starting, because I have a huge change re-formatting everything with black and then a bunch of whole-repo edits fixing many pylint warnings. I'm currently debugging these changes, because they unexpectedly broke tests, and I'm trying to figure out why as we speak.)
... I have a huge change re-formatting everything with
blackand then a bunch of whole-repo edits fixing manypylintwarnings. I'm currently debugging these changes, because they unexpectedly broke tests, and I'm trying to figure out why as we speak. ...
This bit is done now.
Thanks for the quick and detailed reply (and sorry for the slow response)
Also, realistically, a good search page needs to be completely asynchronous
I'm not sure what you mean by this, could you explain?
I haven't fully fleshed out the whole search http query architecture, but one possible way is to have requests return a fixed number of results (e.g. 100) along with a continuation token -- if sqlite can cough up 100 results at a time in (and this depends on the indexing structure used as well of course, but 10ms--100ms should be doable) then long-running queries would be broken up into many individual requests, which would prevent one long query from stalling the entire server for everyone
I am likely also missing background context on how the hoardy-web web server is deployed; e.g. I would assume that it's intended for a small number of occasional users, and that a bit of tail latency from concurrent search requests is no big deal, but I could be totally wrong there
But, I suppose, if this is super-important to you, I would not be completely opposed to accepting a hacky implementation with a simple
sqliteindex, with an understanding that this will be re-implemented in the future, and the DB format will not be compatible with the future version and everything will need to be re-indexed.
Sounds reasonable to me :) I'll see what I can come up with
Though, if you plan to do this, then please wait a couple of days before starting, because I have a huge change re-formatting everything with
black... This bit is done now.
nice :)
Also, realistically, a good search page needs to be completely asynchronous
I'm not sure what you mean by this, could you explain?
I haven't fully fleshed out the whole search http query architecture, but ...
I mean, the problem with any broken-up-with-HTTP-continuations synchronous design is that HTTP requests themselves will still be processed synchronously, so while the search is generating its 100 results, the archiving will stop working.
You can put the search into a separate OS thread, and query its state periodically instead, I suppose, but then it's hard to know when that thread should stop if the users closes the relevant page.
A good async implementation would use WebSockets to get return search results, solving both issues.
while the search is generating its 100 results, the archiving will stop working
Ah yes this is true, but I am less concerned for now, since
- archiving requests are not latency sensitive; as long as throughput isn't unduly affected, a tail latency increase is not ideal but also likely unnoticable in practice
- there are workarounds like running multiple workers with uwsgi / gunicorn / etc (if, as I believe, the server is "stateless"), or running separate servers, one for archival requests and one for search requests
- it's not yet clear what query latencies would be like in practice, so maybe they'd be low enough to not have to handle specially, and
- if you're planning a sync -> async change, that's something that can likely be done without an extra search endpoint making it more difficult
So I will focus for now on getting something that works, and reconsider this once I have a clearer idea of actual performance dynamics
You can put the search into a separate OS thread, and query its state periodically instead, I suppose, but then it's hard to know when that thread should stop if the users closes the relevant page.
Possible! That thread could be limited to precomputing the next N results, with an eventual timeout -- but the first request must either be done synchronously or have special handling, which would be nice to avoid if possible
A good async implementation would use WebSockets to get return search results, solving both issues.
I'm not familiar with how websockets would work with a Flask server -- as I understand it, websockets are incompatible with the WSGI protocol, so they'd presumably need some extra handling with the actual HTTP server bit, which might complicate your plan to change to async. But if you are strongly in favour of a websocket implementation, it should also be not too much work to change a standard flask route based implementation to use websockets, once that standard implementation exists :)
while the search is generating its 100 results, the archiving will stop working
- archiving requests are not latency sensitive; as long as throughput isn't unduly affected, a tail latency increase is not ideal but also likely unnoticable in practice
Depends on the search speed, I suppose. Having thousands of reqres waiting in extension memory would be annoying.
- there are workarounds like running multiple workers with uwsgi / gunicorn / etc (if, as I believe, the server is "stateless"), or running separate servers, one for archival requests and one for search requests
The server is stateful since archiving -> dump parsing -> indexing is stateful.
- if you're planning a sync -> async change, that's something that can likely be done without an extra search endpoint making it more difficult
Yes, which is why I put it off for later. :)
So I will focus for now on getting something that works, and reconsider this once I have a clearer idea of actual performance dynamics
Meanwhile, I'm actively working on cleaning up and publishing my file indexer.
I'm not familiar with how websockets would work with a Flask server
hoardy-web uses Bottle, not Flask, Flask is too complex for me.
(Bottle is too, a bit.
I would prefer a bare-HTTP "framework" with request dispatch instead of wrappers over WSGI/CGI/FCGI.
But it is the simplest thing I know of, ATM, so hoardy-web uses it.
Yes, I'm very opinionated.)
-- as I understand it, websockets are incompatible with the WSGI protocol, so they'd presumably need some extra handling with the actual HTTP server bit, which might complicate your plan to change to async.
WebSockets: you make an HTTP request, it ends with "101 Switching Protocols", and the rest of the connection is now a WebSockets connection.
WebSockets protocol is, basically, message-based TCP, i.e. guaranteed order and delivery, but not a plain byte stream, but separate typed messages.
But, since it's bidirectional, both sides can notice when the other disconnects or just stops working (there's a PING message type).
So, as to your statement: not really, search would simply spawn a separate thread (OS or async, does not matter) and quietly work away, talking to its own WebSocket. And immediately stop if that socket dies.
The server is stateful since archiving -> dump parsing -> indexing is stateful.
This is only stateful because the serve path's index is maintained in memory (in the SortedIndex) right?
Meanwhile, I'm actively working on cleaning up and publishing my file indexer.
Of course you may be planning to go in a completely different direction~ but just for reference, I have some prototype code implementing that interface via sqlite on disk in https://github.com/aidanholm/hoardy-web/commit/33b8b9d9d08a; this is still hacked together and assumes wrr only, only one archival dir, etc, but seems to be working nicely given those constraints
With a very low-effort schema, I got around a 1.4MB database for 275 MB of (hopefully representative) wrr files, so around 1.5% overhead, serving startup is now "instant", and IIUC this would make the server stateless
hoardy-webuses Bottle, not Flask, Flask is too complex for me.
I can only agree :) I've used flask a fair bit at $WORK and found it deceptively simple (haven't tried Bottle yet) so this stance makes complete sense to me
The server is stateful since archiving -> dump parsing -> indexing is stateful.
This is only stateful because the serve path's index is maintained in memory (in the
SortedIndex) right?
Yes, but even if it were not, replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.
I want those buttons to work even if the tab in question is not yet fully fetched, they would wait for everything to fetch and get archived, and then immediately switch to the replay. Which needs the replay to be synchronous with archival.
I have some prototype code implementing that interface via sqlite on disk in https://github.com/aidanholm/hoardy-web/commit/33b8b9d9d08a; this is still hacked together and assumes wrr only, only one archival dir, etc, but seems to be working nicely given those constraints
Yes, this is basically what I expect it would look like.
(Also, SortedIndex clearly needs a generic interface.)
With a very low-effort schema, I got around a 1.4MB database for 275 MB of (hopefully representative) wrr files, so around 1.5% overhead, serving startup is now "instant", and IIUC this would make the server stateless
Your implementation is cute, but this won't work for WRR-bundles and etc, as those need to be sub-file.
Also, full-text indexing will need indirections to deduplicate indexing of same-content data.
The complete version won't be as cute, unfortunately.
replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.
I'm not sure I If I understand correctly; do you mean that if there is an ongoing dumping request for a URL, then any replay request for that same URL should block until the dump completes?
Wouldn't this also require the replay server and the archival/dumping server to share the index (or share at least some state)?
this won't work for
WRR-bundles and etc, as those need to be sub-file.
IIUC a wrr bundle is basically wrr bundles directly byte-concatenated? An offset column could be added (and a size column if it cannot be inferred by the wrrb loader)
Also, full-text indexing will need indirections to deduplicate indexing of same-content data.
I am currently playing around a bit with sqlite's full text search -- it is possible to index text without storing a copy of the indexed content, and indexing response bodies only for text/http response content-types results in reasonably small indexes even without doing any content deduplication; I got index sizes of about 4% of the data store
replays have to be synchronous with dumping anyway, otherwise my plans for replay buttons in the extension will break.
I'm not sure I If I understand correctly; do you mean that if there is an ongoing dumping request for a URL, then any replay request for that same URL should block until the dump completes?
No, that's a bit too strict. (And I'm not sure how one could make hoardy-web serve ever guarantee that.)
But, as a client, if you dump a new visit for a URL, the server says 200 OK, and hence you immediately ask for a replay of this same URL, the server should replay the latest version, not some version from before.
Wouldn't this also require the replay server and the archival/dumping server to share the index (or share at least some state)?
Yes, but that's kind of the point of having them in the same process.
Which is why it needs to be properly async.
But, as a client, if you dump a new visit for a URL, the server says
200 OK, and hence you immediately ask for a replay of this same URL, the server should replay the latest version, not some version from before.
Ah I see, makes sense
Yes, but that's kind of the point of having them in the same process.
I am currently running hoardy_web_sas.py separately to hoardy_web serve; not sure how supported this configuration is in general, but I'd guess the extension's replay button feature would only work / be available when connected to a server with both archival and serving enabled?
I am currently running
hoardy_web_sas.pyseparately tohoardy_web serve; not sure how supported this configuration is in general,
It would work fine if you disable capture before going to a replay URL, otherwise replays would get archived too.
but I'd guess the extension's replay button feature would only work / be available when connected to a server with both archival and serving enabled?
Correct.