onyx icon indicating copy to clipboard operation
onyx copied to clipboard

Add MediaWiki and Wikipedia Connectors

Open qthequartermasterman opened this issue 11 months ago • 7 comments

Resolves #1141.

I am happy to iterate on this with inputs from the devs.

Summary

This PR adds a general MediaWiki Connector which will connect to most MediaWiki sites, including Wikipedia, fandom sites, and many others. There is also a subclass Wikipedia Connector which is a light wrapper around the MediaWikiConnector which uses the special handling for Wikipedia.

The connector is based on pywikibot.

It will optionally recurse over categories to obtain additional pages.

It supports both polling and loading.

Possible Future improvements

There is a solution for handling general MediaWiki sites which generates a Family class automatically by querying a given site using several heuristics (built into pywikibot). This will not handle any special cases however. Wikipedia, for example, has some extra language sites that wouldn't otherwise be found by the generic technique. This special Family class is built into pywikibot, and is used here. There are many more special Family classes to deal with various sites built into pywikibot. None of these other special cases are included, because it's not clear to me which ones would be useful.

Additionally, there is no special handling for other types of pages, such as talk pages; just regular pages and categories.

qthequartermasterman avatar Mar 23 '24 21:03 qthequartermasterman

@qthequartermasterman is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Mar 23 '24 21:03 vercel[bot]

@yuhongsun96 How can I make this easier to review?

qthequartermasterman avatar Apr 06 '24 17:04 qthequartermasterman

Hi! Will try to get to it soon, apologies on the delay and thanks for your patience with us

Thanks also for the great work and contribution!

yuhongsun96 avatar Apr 08 '24 22:04 yuhongsun96

@yuhongsun96 Any update on this?

qthequartermasterman avatar May 23 '24 19:05 qthequartermasterman

Taking a look now 🫡 , thanks!

yuhongsun96 avatar May 23 '24 20:05 yuhongsun96

Looks good, a couple requests:

  • Let's make these the Polling type (so it should show in the bottom section and pull in updated documents every day or so)
  • Let's place the icons before request tracker in the bottom list
  • Would be really nice if you also created a guide page for it in the docs: https://github.com/danswer-ai/danswer-docs
  • Please rebase it, looks like only minor conflicts
Screenshot 2024-05-23 at 2 15 49 PM

Thanks for the amazing work!

yuhongsun96 avatar May 23 '24 21:05 yuhongsun96

@yuhongsun96

  • Let's make these the Polling type (so it should show in the bottom section and pull in updated documents every day or so)

The connectors already inherit from PollConnector--MediaWikiConnector directly, and WikipediaConnector via MediaWikiConnector.

class MediaWikiConnector(LoadConnector, PollConnector):

I also swapped category: SourceCategory.ImportedKnowledge, for category: SourceCategory.AppConnection, in sources.py so that it is on the bottom section in the admin page.

Is that what you're referring to?

Also, would you like me to update the refreshFreq on page.tsx for both connectors to be a day? It's currently the default 10 minutes.

  • Let's place the icons before request tracker in the bottom list

Does this look like what you're asking?

Screenshot 2024-05-23 at 9 39 32 PM
  • Would be really nice if you also created a guide page for it in the docs: https://github.com/danswer-ai/danswer-docs

I will open a PR doing so shortly. It may be a few days given the upcoming holiday weekend.

qthequartermasterman avatar May 24 '24 02:05 qthequartermasterman

Ya, that's perfect, the bottom section is for "poll" connectors, the top for "load", that's the way most users think about it! Granted the Web connector does update but a lot of people already have it mentally associated the other way so we never moved it :P

I can change the poll frequency myself, that's trivial, a day seem reasonable!

Thanks for the amazing work and looking forward to the docs!

yuhongsun96 avatar May 24 '24 15:05 yuhongsun96