Provide full-text search of documents described in CDXJ
At one point, https://github.com/ipfs-search/ipfs-search allowed searching of the content of dereferenced payloads described by a set of IPFS hashes. Because IPFS hashes are describing non-changing content, being able to search the contents needs only a single iteration of indexing.
One initial target audience for ipwb is for smaller collections of archival content, identified through our conventional CDXJ indexing and replay procedure.
Provide a mechanism to allow a user to search the contents of the payloads identified in the currently loaded CDXJ. @ibnesayeed Please provide insights on this.
I think I mentioned ipfs-search offline after which you created this issue, but I forgot all the points I put forward at that time. In general, I think running an optional instance of it and adding a hook in the indexer script will be great. However, putting this all together for a more flexible and coherent setup would require some rework in the indexer script (which is not very scalable as it currently is).
@ibnesayeed It's outside of the scope of this ticket but I think it would be worthwhile to identify what is needed to make the indexer script more scalable in a separate issue.
If you recall or come up with anything regarding approaches in performing the goals of this ticket, please document it here. It would be good to get this right the first time instead of coming up with a working implementation only have to rewrite it in favor of a better approach.
Our current indexer is tailored to work well with smaller collections while minimizing friction for users, which comes with the cost of not being very scalable. I think there are some common pieces that can be extracted out and then there can be more than one wrapper scripts to handle different situations. However, we can discuss that in a separate ticket as you suggested.
Getting back to this topic, I think we can broadly see two approaches on indexing archival content for fulltext searching:
- Run a monolithic system with
ipfs-searchalong with other processes such as IPFS daemon on the same host, index content during the initial CDXJ indexing process while iterating over WARC records. This approach would be slow, not scalable, but a lot of complexities can be hidden from the user and searching will be available as soon as WARCs are processed. - Run the
ipfs-searchsystem as an add-on service while rest of the system still functions as normal without it. In this approach, we can use CDXJ files to decide which records need to be indexed then fetch those records from the IPFS rather than consulting WARC files. This way, fulltext search indexing is performed asynchronously to incrementally making more and more of the content searchable. The replay system needs to be configured to check for the optional fulltext search service if present.
I prefer the second approach, particularly adding the search ability as an add-on. Incrementing async indexing for search is also an interesting approach as well but seems like it would be a scalable solution compared to consulting the WARCs.