marian icon indicating copy to clipboard operation
marian copied to clipboard

How to deal with copied words in source sentences

Open lkluo opened this issue 5 years ago • 0 comments

I am sorry this issue is not directly related to the project.

In MT, some words/phrases are not translated, but copied from source sentences, such as person names, company names, etc. It occurs to me that there could be two approaches:

  • Use shared vocabularies for both source and target languages; however, one one hand, the Vocab size could be very large; and one the other hand, MT may be unaware what words/phrases that needn't be translated unless it sees in the training set.
  • Use pre-process, for example, to detect the words/phrases as named entities, rare words, etc, and replace them with special tokens. I have tried Spacy NER, which is not accurate enough in practice.

I tried Google translate and other translate apps, and to some extend, I found their systems can determine the copied words/phrases, though not perfectly. Could someone advise, in general, what is the best solution to this problem? Thanks.

lkluo avatar May 08 '19 10:05 lkluo