onyx
onyx copied to clipboard
Add MediaWiki and Wikipedia Connectors
Resolves #1141.
I am happy to iterate on this with inputs from the devs.
Summary
This PR adds a general MediaWiki Connector which will connect to most MediaWiki sites, including Wikipedia, fandom sites, and many others. There is also a subclass Wikipedia Connector which is a light wrapper around the MediaWikiConnector which uses the special handling for Wikipedia.
The connector is based on pywikibot
.
It will optionally recurse over categories to obtain additional pages.
It supports both polling and loading.
Possible Future improvements
There is a solution for handling general MediaWiki sites which generates a Family
class automatically by querying a given site using several heuristics (built into pywikibot
). This will not handle any special cases however. Wikipedia, for example, has some extra language sites that wouldn't otherwise be found by the generic technique. This special Family
class is built into pywikibot
, and is used here. There are many more special Family
classes to deal with various sites built into pywikibot
. None of these other special cases are included, because it's not clear to me which ones would be useful.
Additionally, there is no special handling for other types of pages, such as talk pages; just regular pages and categories.
@qthequartermasterman is attempting to deploy a commit to the Danswer Team on Vercel.
A member of the Team first needs to authorize it.
@yuhongsun96 How can I make this easier to review?
Hi! Will try to get to it soon, apologies on the delay and thanks for your patience with us
Thanks also for the great work and contribution!
@yuhongsun96 Any update on this?
Taking a look now 🫡 , thanks!
Looks good, a couple requests:
- Let's make these the Polling type (so it should show in the bottom section and pull in updated documents every day or so)
- Let's place the icons before request tracker in the bottom list
- Would be really nice if you also created a guide page for it in the docs: https://github.com/danswer-ai/danswer-docs
- Please rebase it, looks like only minor conflicts
Thanks for the amazing work!
@yuhongsun96
- Let's make these the Polling type (so it should show in the bottom section and pull in updated documents every day or so)
The connectors already inherit from PollConnector
--MediaWikiConnector
directly, and WikipediaConnector
via MediaWikiConnector
.
class MediaWikiConnector(LoadConnector, PollConnector):
I also swapped category: SourceCategory.ImportedKnowledge,
for category: SourceCategory.AppConnection,
in sources.py
so that it is on the bottom section in the admin page.
Is that what you're referring to?
Also, would you like me to update the refreshFreq
on page.tsx
for both connectors to be a day? It's currently the default 10 minutes.
- Let's place the icons before request tracker in the bottom list
Does this look like what you're asking?
- Would be really nice if you also created a guide page for it in the docs: https://github.com/danswer-ai/danswer-docs
I will open a PR doing so shortly. It may be a few days given the upcoming holiday weekend.
Ya, that's perfect, the bottom section is for "poll" connectors, the top for "load", that's the way most users think about it! Granted the Web connector does update but a lot of people already have it mentally associated the other way so we never moved it :P
I can change the poll frequency myself, that's trivial, a day seem reasonable!
Thanks for the amazing work and looking forward to the docs!