haystack-core-integrations icon indicating copy to clipboard operation
haystack-core-integrations copied to clipboard

Add support for Reader API to convert HTMLs into Documents

Open bilgeyucel opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. There's no component to use Jina's Reader API with Haystack.

Describe the solution you'd like A new JinaHTMLtoDocument (name TBD) component to use Jina's Reader API to convert URLs into Haystack Documents. This component should accept a URL and output a Haystack Document.

Describe alternatives you've considered

  • This component can output a markdown file and users might use MarkdownConverter to use that component in a pipeline (not Haystack intuitive but might have advantages)
  • Depending on how the Reader API works, it can accept a list of URLs and return a list of Haystack Documents

Additional context Add any other context or screenshots about the feature request here.

bilgeyucel avatar Apr 16 '24 09:04 bilgeyucel

I was the one who proposed this component. Unfortunately, I tried the service and it is quite unstable at the moment.

anakin87 avatar Apr 26 '24 07:04 anakin87

Hey there @bilgeyucel @anakin87!

Is this still a thing? I just toyed around with the API and got good results. Would be happy to knock this out if you guys think its valuable.

jlonge4 avatar Oct 16 '24 23:10 jlonge4

@jlonge4 I think the API improved over time.

I see they now have different endpoints for converting a page into markdown, searching the web and grounding (experimental). What's your idea?

anakin87 avatar Oct 17 '24 06:10 anakin87

@anakin87 I think it's pretty cool. Do you think the existing LinkContentFetcher/Web Search components have too much overlap in functionality with it?

jlonge4 avatar Oct 17 '24 09:10 jlonge4

I would say it is just another nice option.

Are you thinking of a single component or more than one?

anakin87 avatar Oct 17 '24 09:10 anakin87

@anakin87 I agree! Would passing modes at init time to a single component make sense? Like reader = JinaReader(mode="read") or something to designate which endpoint to use.

jlonge4 avatar Oct 17 '24 09:10 jlonge4

I'm thinking of something like:

@component
class JinaReader():

    def __init__(
        self,
        api_key: Secret = Secret.from_env_var("JINA_API_KEY"),
        mode: Union[Mode, str],
        ...
    ):
    ...

    @component.output_types(document=Document)
    def run(self, input:str):

    # check input depending on mode
    ...

Mode can be an Enum like this (with a convenient from_str method): https://github.com/deepset-ai/haystack-core-integrations/blob/ac0e4c2f8c8d0dce7a32e8e3a3fe74362b0686dd/integrations/nvidia/src/haystack_integrations/components/embedders/nvidia/truncate.py#L4

anakin87 avatar Oct 17 '24 13:10 anakin87

@anakin87 @jlonge4, what are the exact features of Jina Reader API now? I'm asking because we use reader components for extractive QA tasks, and I don't think the JinaReader component will fit well into that category. Does it make sense to name it JinaReaderConverter, maybe?

bilgeyucel avatar Oct 17 '24 13:10 bilgeyucel

  • Convert URL into Markdown
  • Search the web and convert results to Markdown
  • Ground a statement with web knowledge (only paid, haven't tried)

https://jina.ai/reader/

anakin87 avatar Oct 17 '24 13:10 anakin87

@anakin87 looks great, I'll get it cooking asap! @bilgeyucel you have a great point, it definitely is more of a converter or fetcher vs a reader.

jlonge4 avatar Oct 17 '24 14:10 jlonge4

@jlonge4 I have added a tasklist to https://github.com/deepset-ai/haystack-core-integrations/issues/663#issue-2245585294.

Could you maybe help with opening a PR to mention the JinaReaderConnector in https://github.com/deepset-ai/haystack-integrations/blob/main/integrations/jina.md? (I see the focus is on embedding models, so maybe a brief mention + link to examples is OK)

anakin87 avatar Nov 21 '24 17:11 anakin87

@anakin87 you've got it, no problem 😎

jlonge4 avatar Nov 22 '24 04:11 jlonge4

@anakin87 https://github.com/deepset-ai/haystack-integrations/pull/288

jlonge4 avatar Nov 23 '24 17:11 jlonge4

Closing this issue.

(Only social media announcement is missing.) Added an item for this component to Weekly Announcements - https://github.com/deepset-ai/devrel-board/issues/533

anakin87 avatar Nov 28 '24 10:11 anakin87