open_deep_research icon indicating copy to clipboard operation
open_deep_research copied to clipboard

Is it possible to use the ArXiv API instead of perplexity?

Open LuOsorio opened this issue 9 months ago • 7 comments

In the "search_api" configuration is it possible to use ArXiv to retrieve scientific papers and use that information instead of a regular web search?

LuOsorio avatar Feb 24 '25 13:02 LuOsorio

I've added EXA as a search API, and arXiv and PubMed as separate tools. It's super easy to integrate.

Here's the arXiv tool documentation:
https://python.langchain.com/docs/integrations/tools/arxiv/

Make sure that the response is formatted in the structure expected by deduplicate_and_format_sources.

Below is an example from my implementation:

configuration.py

...
class SearchAPI(Enum):
    PERPLEXITY = "perplexity"
    TAVILY = "tavily"
    EXA = "exa"
...

graph.py

This logic appears in multiple places:

...
    # Search the web
    if search_api == "tavily":
        search_results = await tavily_search_async(query_list)
        source_str = deduplicate_and_format_sources(search_results, max_tokens_per_source=1000, include_raw_content=False)
    elif search_api == "perplexity":
        search_results = perplexity_search(query_list)
        source_str = deduplicate_and_format_sources(search_results, max_tokens_per_source=1000, include_raw_content=False)
    elif search_api == "exa":
        search_results = await exa_search(query_list)
        source_str = deduplicate_and_format_sources(search_results, max_tokens_per_source=1000, include_raw_content=False)

In your arxiv_search method, ensure that the returned structure matches the format expected by deduplicate_and_format_sources:

"""
...
    Args:
        search_queries (List[SearchQuery]): List of search queries to process

    Returns:
        List[dict]: List of search responses from the Perplexity API, one per query. Each response should have the format:
            {
                'query': str,                    # The original search query
                'follow_up_questions': None,      
                'answer': None,
                'images': list,
                'results': [                     # List of search results
                    {
                        'title': str,            # Title of the search result
                        'url': str,              # URL of the result
                        'content': str,          # Summary/snippet of the content
                        'score': float,          # Relevance score
                        'raw_content': str|None  # Full content or None for secondary citations
                    },
                    ...
                ]
            }
...
"""

# Your search logic

Let me know if you need any help.

bartolli avatar Feb 24 '25 19:02 bartolli

I've added EXA as a search API, and arXiv and PubMed as separate tools. It's super easy to integrate.

Yes, do you mind creating a PR? These are nice additions.

rlancemartin avatar Feb 25 '25 05:02 rlancemartin

That's awesome! I'll try to replicate that. Thank you so much! @bartolli

LuOsorio avatar Feb 25 '25 12:02 LuOsorio

I've added EXA as a search API, and arXiv and PubMed as separate tools. It's super easy to integrate.

Yes, do you mind creating a PR? These are nice additions. @rlancemartin Done. I'll add arXiv and PubMed in a separate PR

bartolli avatar Feb 25 '25 19:02 bartolli

Thanks for Exa PR! Had minor comments. Let's also add arXiv and PubMed. Please include them in README.

rlancemartin avatar Feb 26 '25 22:02 rlancemartin

Thanks for Exa PR! Had minor comments. Let's also add arXiv and PubMed. Please include them in README.

Running final tests for PubMed and arXiv, will commit the changes and update the README tonight.

bartolli avatar Feb 27 '25 01:02 bartolli

@rlancemartin Added arXiv and PubMed APIs as requested! Both follow the same pattern as the other search implementations. Ready for review 👍

bartolli avatar Feb 27 '25 04:02 bartolli