haystack-core-integrations
haystack-core-integrations copied to clipboard
Add support for Reader API to convert HTMLs into Documents
Is your feature request related to a problem? Please describe. There's no component to use Jina's Reader API with Haystack.
Describe the solution you'd like A new JinaHTMLtoDocument (name TBD) component to use Jina's Reader API to convert URLs into Haystack Documents. This component should accept a URL and output a Haystack Document.
Describe alternatives you've considered
- This component can output a markdown file and users might use MarkdownConverter to use that component in a pipeline (not Haystack intuitive but might have advantages)
- Depending on how the Reader API works, it can accept a list of URLs and return a list of Haystack Documents
Additional context Add any other context or screenshots about the feature request here.
I was the one who proposed this component. Unfortunately, I tried the service and it is quite unstable at the moment.
Hey there @bilgeyucel @anakin87!
Is this still a thing? I just toyed around with the API and got good results. Would be happy to knock this out if you guys think its valuable.
@jlonge4 I think the API improved over time.
I see they now have different endpoints for converting a page into markdown, searching the web and grounding (experimental). What's your idea?
@anakin87 I think it's pretty cool. Do you think the existing LinkContentFetcher/Web Search components have too much overlap in functionality with it?
I would say it is just another nice option.
Are you thinking of a single component or more than one?
@anakin87 I agree! Would passing modes at init time to a single component make sense?
Like reader = JinaReader(mode="read") or something to designate which endpoint to use.
I'm thinking of something like:
@component
class JinaReader():
def __init__(
self,
api_key: Secret = Secret.from_env_var("JINA_API_KEY"),
mode: Union[Mode, str],
...
):
...
@component.output_types(document=Document)
def run(self, input:str):
# check input depending on mode
...
Mode can be an Enum like this (with a convenient from_str method): https://github.com/deepset-ai/haystack-core-integrations/blob/ac0e4c2f8c8d0dce7a32e8e3a3fe74362b0686dd/integrations/nvidia/src/haystack_integrations/components/embedders/nvidia/truncate.py#L4
@anakin87 @jlonge4, what are the exact features of Jina Reader API now? I'm asking because we use reader components for extractive QA tasks, and I don't think the JinaReader component will fit well into that category. Does it make sense to name it JinaReaderConverter, maybe?
- Convert URL into Markdown
- Search the web and convert results to Markdown
- Ground a statement with web knowledge (only paid, haven't tried)
https://jina.ai/reader/
@anakin87 looks great, I'll get it cooking asap! @bilgeyucel you have a great point, it definitely is more of a converter or fetcher vs a reader.
@jlonge4 I have added a tasklist to https://github.com/deepset-ai/haystack-core-integrations/issues/663#issue-2245585294.
Could you maybe help with opening a PR to mention the JinaReaderConnector in https://github.com/deepset-ai/haystack-integrations/blob/main/integrations/jina.md?
(I see the focus is on embedding models, so maybe a brief mention + link to examples is OK)
@anakin87 you've got it, no problem 😎
@anakin87 https://github.com/deepset-ai/haystack-integrations/pull/288
Closing this issue.
(Only social media announcement is missing.) Added an item for this component to Weekly Announcements - https://github.com/deepset-ai/devrel-board/issues/533