ambuda icon indicating copy to clipboard operation
ambuda copied to clipboard

Support import from sanskrit wikisource

Open epicfaace opened this issue 2 years ago • 3 comments

https://sa.wikisource.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0%E0%A4%AE%E0%A5%8D

epicfaace avatar Aug 12 '22 14:08 epicfaace

@epicfaace If you can outline the steps to add support for wikisource, I'll work on it.

kvchitrapu avatar Aug 14 '22 11:08 kvchitrapu

@epicfaace, is https://github.com/ambuda-org/ambuda/issues/128 conceptually similar to this one?

kvchitrapu avatar Aug 14 '22 11:08 kvchitrapu

@kvchitrapu Yes. It would be somewhat in the vein of this PR (adding SARIT) -- https://github.com/ambuda-org/ambuda/pull/89. The steps are:

  • create a file ambuda/seed/texts/wikisource.py -- start by copying over gretil.py
  • modify the file to, instead of reading from GRETIL, instead going to a particular page on sa.wikisource.org (perhaps using their API -- see https://www.mediawiki.org/wiki/API:Main_page)
  • you would also need to parse out the right content and then turn it into a TEI XML file -- I think some of it might be able to be automated, but it might require a manual process of cleaning up the texts. See this doc for more information on the TEI XML format we need: https://docs.google.com/document/d/18fGk7KraUZHXVDxJR_28viBY_BHyvRdq70VzLgimXq0/edit#heading=h.wukyshidb1n8

epicfaace avatar Aug 14 '22 16:08 epicfaace