api icon indicating copy to clipboard operation
api copied to clipboard

Limit search of annotation content by language

Open azaroth42 opened this issue 9 years ago • 5 comments

Should be able to limit the language of the annotation content in the search. "and" for example is "duck" in Danish. (Donald Duck is Anders And)

Currently can't do this (unless languages had uris that could be filtered on ... which they don't)

azaroth42 avatar Jul 21 '15 19:07 azaroth42

Could implement this by server specific extensions to the q param. (e.g. and@en vs and@dk or whatever). Given lack of overlap in tokens between most languages (e.g. English and Welsh, @glenrobson), in practice it's unlikely to cause many difficulties?

On the other hand, if it were a separate optional parameter, one could search for annotations in latin, without searching for a particular word.

Otherwise, eds agree to defer until more experience.

azaroth42 avatar Jul 29 '15 16:07 azaroth42

We don't generally offer search by language and assume if someone wants Welsh they will search for a Welsh word as you suggest. The only exception to this is for a WW1 project we got some of the Welsh Newspapers OCR machine translated from Welsh to English, we haven't put this live and probably wont be looking at this until next year.

We've struggled a bit on how to make this search functionality intuitive to the user as we don't plan to show the machine generated English as its not awfully accurate so a user would search using an English word like 'war' and be shown Newspaper pages that contain 'rhyfel'. We'd probably want to express the fact the English was machine translated rather than treat it equally to the source English material with a@en tag (not sure how we would do this).

glenrobson avatar Jul 29 '15 23:07 glenrobson

I'm designing our API to search annotations, and this feature would be crucial for us. Our use cases searching terms in Latin alphabet that could be English or romanization of either Pali, Sanskrit or Tibetan. These different possibilities trigger completely different paths in our search engine (different Lucene analyzers).

So for us it's necessary to include a language for the searched term, not only to filter results, but to make sure the search works properly.

eroux avatar Oct 25 '18 10:10 eroux

We have had DH projects that produced manifests with annotations sets in multiple languages, but agree with the discussion above that the ambiguous cases are likely to be infrequent. This would potentially be more useful if content search was implemented at the Collection level, but I'm not aware of real world examples of this.

mikeapp avatar Dec 14 '21 18:12 mikeapp

Should this be included in Search 3.0? (thumbs up/down on the comment please)

azaroth42 avatar Jun 28 '22 16:06 azaroth42