helix icon indicating copy to clipboard operation
helix copied to clipboard

check URLs have text

Open binocarlos opened this issue 1 year ago • 2 comments

some URLs are just javascript and break unstructured - we need a better error: https://www.reuters.com/legal/colorado-ballot-case-adds-fuel-trumps-nomination-drive-2023-12-20/

binocarlos avatar Jan 03 '24 12:01 binocarlos

yeah, also if we don't get any text then don't start the finetuning process - just ask the user to paste the text in instead "sorry we couldn't extract any text from that URL, please copy and paste it instead"

lukemarsden avatar Jan 09 '24 14:01 lukemarsden

I tried to train on a published notion page and it extracted no data at all, Mixtral then hallucinated loads of qapairs about photosynthesis and deep learning and other random shit. We should catch this before we even start training - if there's no text in the training set, just throw an error and ask the user to add more documents or report the issue extracting text from the given documents/URLs to us.

lukemarsden avatar Feb 05 '24 10:02 lukemarsden