helix
helix copied to clipboard
check URLs have text
some URLs are just javascript and break unstructured - we need a better error: https://www.reuters.com/legal/colorado-ballot-case-adds-fuel-trumps-nomination-drive-2023-12-20/
yeah, also if we don't get any text then don't start the finetuning process - just ask the user to paste the text in instead "sorry we couldn't extract any text from that URL, please copy and paste it instead"
I tried to train on a published notion page and it extracted no data at all, Mixtral then hallucinated loads of qapairs about photosynthesis and deep learning and other random shit. We should catch this before we even start training - if there's no text in the training set, just throw an error and ask the user to add more documents or report the issue extracting text from the given documents/URLs to us.