orange3-text
orange3-text copied to clipboard
Concordance: Use re module to look for words
It would be really useful for linguistic research to be able to query using regular expressions and possibly by annotations such as POS or normalized text. This would make Concordance more similar to the usual corpus manager keyword-in-context tools like SketchEngine, Manatee, KonText or CQP.
It's a nice idea, but Orange depends on vastly different data structures than SketchEngine. Orange is not, in essence, intended for querying corpora, but for visualization and machine learning. The services you've mentioned have indexed corpora in the background. Orange doesn't. So this is mostly a question of what each tool is intended for.
Querying by regular expressions is already enabled in Corpus Viewer (the view is not concordance, though, just a running text with highlighted words).
Sure, I did not mean to suggest that Orange should become a corpus query manager. However, the Text plugin has great appeal for textual analysis and the KWIC or Concordance is in my opinion the basic tool and should come handy for almost any text exploration and analysis. Without regexp, querying synthetic languages (unlike English) is really problematic. As you point out, Corpus Viewer already has this feature, so that made me think that perhaps adding that to Concordance might not be that difficult. But of course, I understand, this might not be in your plans.
The thing is Orange currently uses the NLTK structure which leverages tokens for building concordances. This leads to all sorts of problems, such as #320. Tokens, as you can imagine, cannot work with regular expressions, because search is not looking at the whole text, but at single words.
I agree that this would be a great added value, but it really comes down to who can implement this. Our lab is too small and project-dependent to be able to tackle larger side-tasks. 😞 I promise to think about it and see what can be done.
OK, I understand. I have now noticed you can actually achieve this with the Textable plugin, so it's not that urgent :)
Thanks!
Note for developers: try using re
library for search (https://docs.python.org/3/library/re.html), find index of matches and show index +- specified range. Might work and also solve #320.