Legacy-Research-Engine
Legacy-Research-Engine copied to clipboard
Refactoring importing process
Hey folks,
So i made a clickdummy for the new importing process. Happy to hear your input on it. the wireframes can be changed it quite quickly. It starts with the screen after you would use "Import History & Bookmarks" in the sidebar. You can access it here: https://invis.io/KZ8XQZ1BR#/216813851_SETUP_-_Analyse_URLS
bold elements can be retrieved via the chrome API
@aquibm wanted to do the front end part with react
@ShamariFeaster I think this is highly relevant for your refactoring
Regarding the development there are 7 modules that need to be developed.
- Accessing history urls via chrome.history()-api and storing them as "empty shells" (without any crawled content) in the PouchDB containing:
- Unique ID
- url
- title
- text: none (default)
- lastVisitTime
- History_Item_ID (from Chrome ID)
-
array of visits
- id
- visitTime
- is_bookmark: false (default)
- bookmark_ID: none (default)
- bookmark_date_added: none (default)
- download_status: failed/successful/not_started
- Accessing bookmarks Urls via chrome.bookmarks-api and storing them as "empty shells" (without any crawled content) in the PouchDB.
- Check before if URL already exists, if yes update:
- is_bookmark: true
- bookmark_ID: none (default)
- bookmark_date_added: none (default)
- If no, add empty shell containing:
- Unique ID
- url
- title
- text: none (default)
- lastVisitTime (empty by default)
- History_Item_ID (from Chrome ID)
-
array of visits (empty as for now)
- id
- visitTime
- is_bookmark: true
- bookmark_ID:
- bookmark_date_added:
- Check before if URL already exists, if yes update:
- Module that checks file type and directs URL to fitting download modules.
- HTML Document
- Module that downloads HTML via XMLHTTP-Request
- Update data in url
- text
- Update data in url
- Module that downloads PDFs
- Update data in url
- text
- Update data in url
- Module that ensures continuation of download if errors come up while downloading. (including errorlogging)
- there are issues encountered at the moment stopping the download. This should not happen. For example:
- Security warnings (Fishing websites)
- too long response time (implement time out)
- there are issues encountered at the moment stopping the download. This should not happen. For example:
- Module that keeps tab on all the already (successfully) downloaded URLs (so that a download can easily be interrupted(restart browser, pausing heavy error), logging in import process or check if URL already imported before restarting import process
- Store list of all urls that were successful
- when importing check if already downloaded
- Store list of all urls that failed, including errormessages
- Store list of all urls that were successful
open questions:
- is the history ID and the bookmarks ID the same? If yes, we could check for the bookmarks ID directly in the request for the history API and store the
is_bookmark
(etc.) points with it.
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.