orange3-text icon indicating copy to clipboard operation
orange3-text copied to clipboard

Concordance: Use re module to look for words

Open otichy opened this issue 3 years ago • 5 comments

It would be really useful for linguistic research to be able to query using regular expressions and possibly by annotations such as POS or normalized text. This would make Concordance more similar to the usual corpus manager keyword-in-context tools like SketchEngine, Manatee, KonText or CQP.

otichy avatar Oct 01 '21 17:10 otichy

It's a nice idea, but Orange depends on vastly different data structures than SketchEngine. Orange is not, in essence, intended for querying corpora, but for visualization and machine learning. The services you've mentioned have indexed corpora in the background. Orange doesn't. So this is mostly a question of what each tool is intended for.

Querying by regular expressions is already enabled in Corpus Viewer (the view is not concordance, though, just a running text with highlighted words).

ajdapretnar avatar Oct 04 '21 07:10 ajdapretnar

Sure, I did not mean to suggest that Orange should become a corpus query manager. However, the Text plugin has great appeal for textual analysis and the KWIC or Concordance is in my opinion the basic tool and should come handy for almost any text exploration and analysis. Without regexp, querying synthetic languages (unlike English) is really problematic. As you point out, Corpus Viewer already has this feature, so that made me think that perhaps adding that to Concordance might not be that difficult. But of course, I understand, this might not be in your plans.

otichy avatar Oct 09 '21 20:10 otichy

The thing is Orange currently uses the NLTK structure which leverages tokens for building concordances. This leads to all sorts of problems, such as #320. Tokens, as you can imagine, cannot work with regular expressions, because search is not looking at the whole text, but at single words.

I agree that this would be a great added value, but it really comes down to who can implement this. Our lab is too small and project-dependent to be able to tackle larger side-tasks. 😞 I promise to think about it and see what can be done.

ajdapretnar avatar Oct 11 '21 06:10 ajdapretnar

OK, I understand. I have now noticed you can actually achieve this with the Textable plugin, so it's not that urgent :)

Thanks!

otichy avatar Oct 11 '21 16:10 otichy

Note for developers: try using re library for search (https://docs.python.org/3/library/re.html), find index of matches and show index +- specified range. Might work and also solve #320.

ajdapretnar avatar Nov 15 '21 09:11 ajdapretnar