Legacy-Research-Engine Refactoring importing process

Refactoring importing process

Open blackforestboi opened this issue 8 years ago • 0 comments

Hey folks,

So i made a clickdummy for the new importing process. Happy to hear your input on it. the wireframes can be changed it quite quickly. It starts with the screen after you would use "Import History & Bookmarks" in the sidebar. You can access it here: https://invis.io/KZ8XQZ1BR#/216813851_SETUP_-_Analyse_URLS

bold elements can be retrieved via the chrome API

@aquibm wanted to do the front end part with react

@ShamariFeaster I think this is highly relevant for your refactoring

Regarding the development there are 7 modules that need to be developed.

Accessing history urls via chrome.history()-api and storing them as "empty shells" (without any crawled content) in the PouchDB containing:
- Unique ID
- url
- title
- text: none (default)
- lastVisitTime
- History_Item_ID (from Chrome ID)
- array of visits
  - id
  - visitTime
- is_bookmark: false (default)
- bookmark_ID: none (default)
- bookmark_date_added: none (default)
- download_status: failed/successful/not_started
Accessing bookmarks Urls via chrome.bookmarks-api and storing them as "empty shells" (without any crawled content) in the PouchDB.
- Check before if URL already exists, if yes update:
  - is_bookmark: true
  - bookmark_ID: none (default)
  - bookmark_date_added: none (default)
- If no, add empty shell containing:
  - Unique ID
  - url
  - title
  - text: none (default)
  - lastVisitTime (empty by default)
  - History_Item_ID (from Chrome ID)
  - array of visits (empty as for now)
    - id
    - visitTime
  - is_bookmark: true
  - bookmark_ID:
  - bookmark_date_added:
Module that checks file type and directs URL to fitting download modules.
- HTML Document
- PDF
Module that downloads HTML via XMLHTTP-Request
- Update data in url
  - text
Module that downloads PDFs
- Update data in url
  - text
Module that ensures continuation of download if errors come up while downloading. (including errorlogging)
- there are issues encountered at the moment stopping the download. This should not happen. For example:
  - Security warnings (Fishing websites)
  - too long response time (implement time out)
Module that keeps tab on all the already (successfully) downloaded URLs (so that a download can easily be interrupted(restart browser, pausing heavy error), logging in import process or check if URL already imported before restarting import process
- Store list of all urls that were successful
  - when importing check if already downloaded
- Store list of all urls that failed, including errormessages

open questions:

is the history ID and the bookmarks ID the same? If yes, we could check for the bookmarks ID directly in the request for the history API and store the is_bookmark (etc.) points with it.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Jan 31 '17 21:01 blackforestboi

Legacy-Research-Engine Legacy-Research-Engine copied to clipboard

Refactoring importing process

open questions:

Legacy-Research-Engine
Legacy-Research-Engine copied to clipboard