haystack icon indicating copy to clipboard operation
haystack copied to clipboard

TransformersTranslator is exaggerating its responsibilities and changing Inputs

Open danielbichuetti opened this issue 3 years ago • 2 comments

Problem:

  • When you use the TransformersTranslator node, and send a List of Documents, after translation the node correctly output the translated texts. But, it's also changing the input. So, we got inputs == outputs after using the node.

Example code:

DOCS_PT = [
    Document(content="A reiteração que aqui é exigida é a de que, em diversas ocasiões, o Supremo Tribunal venha decidindo uma matéria com maioria e, em determinado momento, por provocação ou de ofício, resolva editar a súmula, buscando o qualificado quorum de dois terços."),
    Document(content="Veda-se, desse modo, a possibilidade da edição de uma súmula vinculante com fundamento em decisão judicial isolada, pois  necessário que ela reflita uma jurisprudência do Tribunal, ou seja, reiterados julgados no mesmo sentido, é dizer, com a mesma interpretação."),
    Document(content="Por mais relevante que seja a matéria, não se concebe a edição de um enunciado sumular vinculante no primeiro ou no segundo julgamento."),
    Document(content="Trata-se de uma questão democrática, que tem na divisão de Poderes o seu ponto central."),
    Document(content="Pudesse o STF emitir súmulas vinculantes a qualquer momento, estar-se-ia admitindo que a discussão pelo restante do sistema teria sido tornada despicienda."),
    Document(content="O Supremo Tribunal, à revelia da Constituição, estaria estabelecendo um adiantamento liminar de sentido.")
]

pt_to_en = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-mul-en", use_gpu=True)
DOCS_EN = pt_to_en.translate(documents=DOCS_PT, query=None)

Past discussion:

@ZanSara

The original aim, most likely, was to make the document store contain only documents in a uniform language, to allow the Retriever to work properly. In fact, if the retriever is monolingual, documents written in other languages can break it. So Translator is making sure that the original document, written in another language, is no more present, to avoid this scenario. On the other hand though, I don't think this is the right way to handle the situation :sorrindo_suor: I can see two ways we can improve this: Move or duplicate the original text into the metadata (so at least it can be retrieved later) - this could be a quick and easy fix for your situation. It could even be done without modifying Haystack, for now. Check if one can filter documents by language before sending them to the Retriever. This could be implemented in the Retriever themselves, or by a new node, depending on how is it done.

@danielbichuetti

Inputs and Outputs are two different concepts on haystack, specially how it's coded. There is not any other pipe which returns outputs and at same time replaces inputs. Indeed, the other method behavior, query, doesn't replace the query itself. Let's think into 51 million American court documents which were indexed. User then decides to translate these documents to German, so he can start to work on a new dataset to get one distilled model in German. User is currently making usage of any document store. After he runs the pipeline, he just discovers he lost all the original documents with Transformers translation. And translating it back, he gets just a piece of the original. Maybe we should let any issues regarding how the user stores documents to the document store itself.

es_doc_store_pt = ElasticsearchDocumentStore(host="some.es.cloud.domain.ai", scheme="https", username="luigi", password="mario", index="court_decisions_pt", analyzer="pt")

DOCS_PT = [
    Document(content="A reiteração que aqui é exigida é a de que, em diversas ocasiões, o Supremo Tribunal venha decidindo uma matéria com maioria e, em determinado momento, por provocação ou de ofício, resolva editar a súmula, buscando o qualificado quorum de dois terços."),
    Document(content="Veda-se, desse modo, a possibilidade da edição de uma súmula vinculante com fundamento em decisão judicial isolada, pois  necessário que ela reflita uma jurisprudência do Tribunal, ou seja, reiterados julgados no mesmo sentido, é dizer, com a mesma interpretação."),
    Document(content="Por mais relevante que seja a matéria, não se concebe a edição de um enunciado sumular vinculante no primeiro ou no segundo julgamento."),
    Document(content="Trata-se de uma questão democrática, que tem na divisão de Poderes o seu ponto central."),
    Document(content="Pudesse o STF emitir súmulas vinculantes a qualquer momento, estar-se-ia admitindo que a discussão pelo restante do sistema teria sido tornada despicienda."),
    Document(content="O Supremo Tribunal, à revelia da Constituição, estaria estabelecendo um "adiantamento liminar de sentido.")
]

pt_to_en = TransformersTranslator(model_name_or_path="Helsinki-NLP/opus-mt-mul-en", use_gpu=True)
res_helsinki_pt_en = pt_to_en.translate(documents=DOCS_PT, query=None)


es_doc_store_en = ElasticsearchDocumentStore(host="some.es.cloud.domain.ai", scheme="https", username="luigi", password="mario", index="court_decisions_en", analyzer="en")

ec_doc_store_en.write_documents(res_helsinki_pt_en)

This is just a sample, but a point about how easy it would be to let the responsibility to the developer.

Here we get to some important considerations about the misusage of the translation mechanisms and why that behavior was probably first implemented:

@ZanSara

However, to stay in your example, having the original English documents given to a German retriever would result in garbage output. That's why the Translator was implemented this way: to prevent people from making such mistake. I don't like this, I also think this is responsibility of another node

Proposed solution draft:

  • The main word for this change is decoupling. It's not a good software engineering decision, generally, to allow a method to output something and at the same time change the input to the same data, going backward on the chain. Specially when there is no documentation, a strong and unique rare case where there is no other possibility.

As per @ZanSara considerations, there are problems if user translates and misuse this results, writing to the same Document Store without any way to differentiate. There could happen to show up cases with a bunch of multi-language into same index of Elasticsearch (for example). But I would like to make a note that this could only happen if the user chooses to do the bad act, not by haystack doing quirks on the chain of responsibility.

My proposal for it:

  • Get Translator node to work as all other nodes, and even as itself when related to query and not document. Output translated documents and keep inputs like they were sent.
  • Make a proper documentation on the Translator node page about the issues of mixing in the same document store without some filtering mechanisms to differentiate languages when using Retriever.
  • Include one example of using RouteDocuments to send to different retrievers based on some metadata (language in this case).

danielbichuetti avatar Jul 27 '22 12:07 danielbichuetti

So, may I move forward and start this change? @sjrl

danielbichuetti avatar Aug 02 '22 13:08 danielbichuetti

Yes that sounds good to me!

sjrl avatar Aug 02 '22 14:08 sjrl