BB3 Internet search: From search URLs to final retrieved documents
I have chatted with the BB3 model using an Internet search server from JulesGM/ParlAI_SearchEngine. However, the results are not good mostly because the retrieved documents are very noisy. Here is an example:
Search Queries: ['dota 2']
Search URLS:
https://www.pcgamingwiki.com/wiki/Dota_2
https://en.wikipedia.org/wiki/Dota_2
https://www.oneesports.gg/dota2/valve-announces-new-dota-2-hero-muerta/
https://dotesports.com/dota-2/news/muerta-revealed-as-next-dota-2-hero-at-ti11
https://dotesports.com/dota-2/news/quincy-crew-curse-soniqs-are-out-of-dota-2-less-than-three-months-after-signing-team
Examples of search_knowledge_doc_content (retrieved documents) Doc_1
* Explore\n* Lists\n* Games\n* Categories\n* Random page\n* Recent changes\n* Troubleshooting guide\n* Editing\n* Editing guide\n* Sample article\n* Projects\n* Taxonomy\n* Wiki policy\n* Maintenance\n* Changelog\n* Community\n* Assignments\n* Discord\n* Files\n* Files policy\n* Forums\n* PCGW Account\n* Other communities\n* About\n* About\n* Conduct\n* FAQ\n* Staff\n* Donate\n* Tools\n* What links here\n* Related changes\n* Special pages\n* Printable version\n* Permanent link\n* Page information\n* Page values\n* Talk\n* Contributions\n*
Doc_2
'# Dota 2\nFrom Wikipedia, the free encyclopedia\nJump to navigation Jump to search\n2013 video game\n2013 video game\nDota 2\nDeveloper(s)Valve\nPublisher(s)Valve\nDesigner(s)IceFrog\nWriter(s)\n* Marc Laidlaw\n* Ted Kosmatka\n* Kris Katz\nComposer(s)\n* Jason Hayes\n* Tim Larkin\nSeriesDota\nEngineSource 2[a]\nPlatform(s)\n* Windows\n* Linux\n* OS X\nRelease\n* Windows\n* July 9, 2013\n* Linux, OS X\n* July 18, 2013\nGenre(s)MOBA\nMode(s)Multiplayer\nDota 2 is a 2013 multiplayer online battle arena (MOBA) video game developed\nand'
Doc_3
"* * * * *\nAbout Press T&C Contact Us\n* Mobile Legends\n* LEAGUE OF LEGENDS\n* Valorant\n* Dota 2\n* Pick'em\n* Genshin Impact\n* Anime\n* More\n* Cosplay\n* Culture\n* Call of Duty\n* Wild Rift\n* Free Fire\n* PUBG\n* Tekken\n* Street Fighter\n* Fortnite\n* Gaming\n* Events\n* About us\n* Work with us\n* Partner with us\n* Press\n* PRIVACY\n* Contact Us\nShop\n* en\n* English\n* Bahasa Indonesia\n* Filipino\n* Tiếng Việt\n* ไทย\nLogin\nLoading...\n* Mobile Legends\n* LEAGUE OF LEGENDS\n* Valorant\n* Dota 2\n* Pick'em\n* Genshin"
. As you can see, there is a lot of noise in the retrieved documents, so I wonder what is the detailed implementation to parse the results returned by the search server (or how can go from search URLs to the final retrieved documents). I believe there are a number of problem occurs such as: (1) How do I extract text from the HTML content of the page? (2) The extracted text in (1) might be very long and also contains a lot of noisy/irrelevant information. How do I select only the relevant part? Is BB3 using any trained model to do this kind of selection? (3) I saw that the paper using a "knowledge response model" to generate a sequence referred to as the knowledge response, given the full input context and a set of retrieved documents. Are these documents the full text of the retrieved page in (1)? or they going to be truncated? (4) The BB3 used Mojeek as a server while SEEKER used Microsoft Bing API. I wonder which one gives a better result? . I had looked into the technical papers but these problems are not mentioned.
(1) From your examples it doesn't look like extracting content from HTML is required; however, I'm sure there are tools available online that will extract appropriately (2 + 3) Indeed, selecting the appropriate knowledge sentence is an open research problem. BB3 was trained to select the most appropriate knowledge sentence from a set of relevant returned search results, and this is indeed the knowledge response module. We truncate each document to ~500 characters before providing it to the agent; these documents are newline-delimited. Also note that the full context + documents must fit into a truncation length of 1024 tokens for BB3 3B (2048 for 30B/175B) (4) SeeKeR mapped Bing URLs to common crawl webpages, whereas BB3 uses snippets from Mojeek; determining which is better is an open research question as well.
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.