Crawling &Indexing content from external website

Open ArtificialOwl opened this issue 3 years ago • 0 comments

Notes:

This will be managed by a specific app,
Website/Page will be identified by Sitemap,
Crawling of a sub-page will be based on a local configuration regarding the allowed numbers of hop if the linked page is hosted locally (same domain), on a different subdomain, or on a completely different domains,
External content have no id within Nextcloud's database, also the crawling of a single address can returns multiple document.

Tasks:

[ ] Allowing an app to directly reach the used Search Platform (fulltextsearch_elasticsearch) without passing through the FullTextSearch index table (core)
[ ] Crawling website and sub-pages (app),
[ ] Extracting content and meta-data for each page (app),
[ ] indexing content and meta-data (app),
[ ] Searching and Advanced Searching within content and metadata (core+app),
[ ] Result should link to the right page (app, might need some work in core to force opening on a different tab)

Jan 19 '22 12:01 ArtificialOwl