orange3-text icon indicating copy to clipboard operation
orange3-text copied to clipboard

Save/export corpus in a more interoperable format

Open bertvandepoel opened this issue 4 years ago • 6 comments

Current situation When a corpus, compiled from imported documents and/or from one of the data sources (The Guardian, Twitter, PubMed, etc.), is saved after filtering, duplicate removal, etc., the result is a single tab file with the data grouped in columns and the documents as rows. If a user wants to use this corpus in another linguistic tool, such as AntConc or #Lancsbox, it will have difficulty with the extra columns (these could potentially be filtered out) and won't be aware that every line is a separate article/document.

Suggested solution It would be a great feature to include a widget to save corpora in one or several common corpus formats. For example: saving the main content as plain text txt files with the title of the article/document as the file name. After parsing through TreeTagger, a format such as CoNLL(-U), FoLiA or TEI would make more sense. The resulting per document files, especially when using plain text txt and CoNLL(-U), are easy to use in existing applications. This would especially benefit those users who rely on tools such as Orange as they are not skilled enough to compile, transform and analyse corpora on their own using Python or R. But this would also be of great convenience for more experienced users to save time and would open up different aspects in teaching with Orange combined with other tools (e.g. during an introduction to corpus linguistics course).

Alternative solution In principle, all features a user may want to use from linguistic tools could be added to orange3-text so exporting to a common format is no longer necessary, but this seems beyond the scope of the project as well as an immense amount of work.

Disclaimer I'm the developer of snelSLiM, a corpus linguistics tool that would benefit from this feature. I would, however, probably be able to add support for the .tab file exported by Orange in my software quite easily, but my tool has very few (if any) users, while tools such as AntConc are ubiquitous.

bertvandepoel avatar Jun 24 '20 15:06 bertvandepoel

Thanks for this thorough description! So in short (if I understand this correctly), a plain text export would be the easiest way to do it? Orange never works with CoNLL(-U) format and it would only be available in Preprocess Text after certain steps.

ajdapretnar avatar Jul 20 '20 09:07 ajdapretnar

A plain text, per document export would indeed be the easiest. I think CoNLL(-U) would indeed be too involved for little extra benefit.

bertvandepoel avatar Jul 20 '20 11:07 bertvandepoel

@ajdapretnar I just noticed you've added the "help wanted" tag. Is there some information available on how to start development for orange3? I'm perhaps willing to contribute a PR, but have no idea how to get started and what the rules and practices are within the project.

bertvandepoel avatar Nov 02 '20 01:11 bertvandepoel

Hi @bertvandepoel! I am so glad you are interested in contributing! Widget development documentation is available here: https://orange3.readthedocs.io/projects/orange-development/en/latest/ I would propose forking the project and looking at Save Data widget (to see how data is normally saved) and Corpus object (to see how the data structure looks like).

Now, the first option would be to support this in Save Data directly, where the user would, presumably, select the option to save the data as raw (.txt). It would only be available for Corpus objects and would save everything from self.text_variables.

The other option would be to have a separate widget for saving raw files.

I am in favour of the first option. I will talk to my colleagues and let you know. In case we go with the first option, the PR would be on the https://github.com/biolab/orange3 repository.

ajdapretnar avatar Nov 02 '20 08:11 ajdapretnar

@ajdapretnar Have you heard back from colleagues yet? Both options sound like good solutions to me, though I think a separate widget may be easier to grasp for certain users. From a technical point of view, the first option makes more sense however.

I think that with the documentation and the code of existing widgets, I should be able to get started.

bertvandepoel avatar Nov 20 '20 12:11 bertvandepoel

@bertvandepoel Yes, we discussed this and we are in favour of saving in plain text directly in Save Data. This aligns well with Import Documents, which is the loader for the said format (files in folders).

So in short:

  • If Save Data receives an instance of Corpus on the input, it provides the option to save as plain text.
  • If the Corpus has a class variable (categorical), the files are saved in separate folders according to the class label.
  • Each .txt file contains the information from self.text_variables. Join them with space.
  • The PR would be implemented in orange3 repository.

ajdapretnar avatar Nov 20 '20 13:11 ajdapretnar