AutoGPT
AutoGPT copied to clipboard
Scientific Paper Search and Comprehension
Duplicates
- [X] I have searched the existing issues
Summary 💡
The agent has a command to search for scientific papers, which finds any matching the query that have the actual text of the paper publicly available somewhere, and returns a list of them and their abstracts. This could be done with a vectors database of embedded abstracts.
A second command is used to download the paper and convert it from PDF to a text format. Unfortunately papers containing math equations have them in LaTeX-generated images, which require some fancy OCR to extract back to LaTeX.
Science papers are also often bigger than the context window so they would need to be condensed section-by-section into something that can fit into a prompt and preserves sufficient information for GPT.
Examples 🌈
GPT-3.5 doesn't really "get" text condensation and basically just summarizes: excluding crucial math equations while refusing to compromise readability. Davinci does much better but needs a good prompt and is almost as expensive as GPT-4.
Here's what I used:
Convert this excerpt from a scientific paper - which I have converted to LaTeX - to a condensed telegraphic style that avoids use of definite or indefinite articles, punctuation, and other words unnecessary for comprehension of the text. Compress it to the fewest number of words possible while retaining enough information for you (GPT) to reconstruct the text later.
There's probably a better way to do this, but I don't know what that is.
Motivation 🔦
As noted in #2528, Auto-GPT currently cannot handle PDFs. The software contractor company I work for targets a variety of unique, difficult problems, and every project has a "References" folder filled with scientific papers, all in PDF format. Unrestricted access to scientific literature will be mandatory for Auto-GPT to reach optimal performance on more complex tasks.
Open Access Paper PDF Retrieval:
- Semantic Scholar API: https://api.semanticscholar.org/api-docs/graph
- Internet Archive Scholar API: https://scholar.archive.org/api/redoc
Text/Metadata Extraction from PDF:
- PyMuPDF: https://pymupdf.readthedocs.io/en/latest/
- GROBID: https://grobid.readthedocs.io/en/latest/
Vector embedding of paper content:
- SPECTER: https://github.com/allenai/specter
Auto-generated short summary of the paper:
- SciTLDR: https://github.com/allenai/scitldr
Duplicate Significant-Gravitas/Auto-GPT-Plugins#34
@ntindle In addition to reading PDFs, this ticket wants to integrate a search engine for scientific papers, and also focuses on comprehension.
Migrate to Auto-GPT-Plugins
@Androbin That's an excellent list, thank you! And I noticed that Semantic Scholar has the SPECTER embeddings and SciTLDR summaries for the papers, which saves the amount of stuff we have to do. I'll have to get in contact with them and make an updated issue in the plugins repo.
@zachary-kaelan Please note that
- the search endpoint does not return the
embedding
andtldr
fields, only the details endpoint does - even the details endpoint does not always return an
abstract
and/ortldr
- https://github.com/allenai/s2-folks/issues/13
- the
embedding
returned by the API is not compatible with the open source SPECTER model, so you can compare two papers but not a paper to a query