Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Remove articles from machine translation process

Open andrewtavis opened this issue 1 year ago • 10 comments
trafficstars

Terms

Languages

All languages

Description

One thing that's coming from the new machine translation process in #81 and #88 is that we're routinely getting articles included in the translations. One way of fixing this is querying the articles from Wikidata for each language and then for each key removing the article and the space between it if it's the start of the translation. There could also be an option to remove these from translation outputs, but I personally am not sure on this.

Happy to discuss and implement or help implement this!

andrewtavis avatar Mar 09 '24 16:03 andrewtavis

CC @wkyoshida and @shashank-iitbhu via our discussion in the sync :)

andrewtavis avatar Mar 09 '24 16:03 andrewtavis

We can add a helper function like remove_articles and then remove articles after each batch is translated. Tried this for english articles.

english_articles = ["a ", "an ", "the "]
translated_words = [remove_articles(translation, english_articles) for translation in translated_words]

remove_articles function:

def remove_articles(translation, articles):
    for article in articles:
        if translation.lower().startswith(article):
            return translation[len(article):]
    return translation

results: Screenshot 2024-03-16 at 7 28 14 PM

shashank-iitbhu avatar Mar 16 '24 14:03 shashank-iitbhu

We could also consider writing a separate script to remove articles, implementing it after the translation process is complete. First, we'll need to query Wikidata to retrieve the articles for the Scribe languages. I'll look into this.

@andrewtavis, what do you think would be the best approach here?

shashank-iitbhu avatar Mar 16 '24 14:03 shashank-iitbhu

I'd say getting the articles from Wikidata for each of the languages makes more sense so it's easier for us to add new languages later on, @shashank-iitbhu :) Thanks for your consideration here!

andrewtavis avatar Mar 17 '24 14:03 andrewtavis

And let's save the articles without the extra space at the end of them and then do a comparison that includes the space. Makes a bit more sense to me than "a " as something we're running through and fits better if we're getting the articles from Wikidata and not writing them ourselves.

andrewtavis avatar Mar 17 '24 14:03 andrewtavis

Another thing to think about here @shashank-iitbhu is that some of the values we're getting out are capitalized... Would 100% be a different issue, but maybe we can look into things once we have the lexeme IDs into the scripts and we can check to see if it's a proper noun or not. Maybe we should add if it's a proper noun to the noun queries, now that I think of it? 🤔 This would allow us to lower case the regular noun outputs (except for German as all nouns in German are capitalized).

andrewtavis avatar Mar 17 '24 14:03 andrewtavis

I'll assign this to you, @shashank-iitbhu, and please let me know if further information is needed!

andrewtavis avatar Mar 17 '24 14:03 andrewtavis

And let's save the articles without the extra space at the end of them and then do a comparison that includes the space. Makes a bit more sense to me than "a " as something we're running through and fits better if we're getting the articles from Wikidata and not writing them ourselves.

I believe the optimal approach would be to split the translations into words, and then check for articles in the first word of the split. Afterward, we can concatenate the words back together. This way, we can avoid storing articles with an added space (e.g., "a ") and can directly use the articles we obtain from Wikidata.

def remove_articles(translation, articles):
    words = translation.split()
    if words and words[0].lower() in articles:
        return ' '.join(words[1:])
    return translation

shashank-iitbhu avatar Mar 17 '24 15:03 shashank-iitbhu

Makes total sense, @shashank-iitbhu 😊 Thanks for the suggestion!

andrewtavis avatar Mar 17 '24 15:03 andrewtavis

Via the sync, would be good to add in a query for articles for this. Happy to support, @shashank-iitbhu!

andrewtavis avatar Apr 20 '24 14:04 andrewtavis

Hey @shashank-iitbhu 👋 We're looking for issues for new contributors now. Hope it's ok that I unassign so someone else can pick it up! Also hope you're well! 😊

andrewtavis avatar Jul 09 '24 12:07 andrewtavis

Hey @shashank-iitbhu 👋 We're looking for issues for new contributors now. Hope it's ok that I unassign so someone else can pick it up! Also hope you're well! 😊

Yes, please assign to someone else. 😄

shashank-iitbhu avatar Jul 09 '24 15:07 shashank-iitbhu

Thanks @andrewtavis, I think there are two way it can be fixed,

  1. hardcoded way, " a", "a ", " a ". we basically search all the articles in a line. if found, then will delete it.
  2. use spaCy when translate.

image

Which one you think is better?

axif0 avatar Jul 10 '24 06:07 axif0

Hey @axif0 👋 My initial inclination here would be to use Wikidata to get the articles 🤔 But then using spaCy might be a better idea as the library would handle things and we would't be making unneeded API calls for something that a package can handle. Question is though, what's spaCy's language coverage for this feature? As we're trying to cover a lot of languages, with many not being "common" within NLP tooling, it might make sense to leverage Wikidata :)

Here's an idea for the process:

from scribe_data.wikidata.wikidata_utils import sparql

def get_all_language_articles(language):
    # SPARQL query template.
    query_langage_template = """
    tool: scribe-data
    SELECT 
        ?article

    WHERE {{
      VALUES ?language { wd:{} }
          ?lexeme dct:language ?language ;
          wikibase:lexicalCategory wd:Q2865743 ;
          wikibase:lemma ?article .
    }}
    """

    # Replace {} in the query template with the language value.
    query = query_langage_template.format(language)

    sparql.setQuery(query)
    results = sparql.query().convert()

    return results["results"]

We'd also need to include Q3813849 along with Q2865743 (the former is indefinite articles and the latter is definite articles that are already queried for). I could help you finalize the results a bit, but generally we'd pass a language QID like Q1860 for English, and then it would return a list of indefinite and definite articles that we could then remove from translations via whitespace removal techniques and checking for their presence :)

Your thoughts would be appreciated, @axif0!

andrewtavis avatar Jul 10 '24 14:07 andrewtavis

Hello @andrewtavis , Thank you for your kind reply.

from scribe_data.wikidata.wikidata_utils import sparql

def get_all_french_articles(qid):
    query = f"""
    SELECT DISTINCT ?article WHERE {{
      VALUES ?language {{ wd:{qid} }}   
      ?lexeme dct:language ?language ;
              wikibase:lexicalCategory ?category ;
              wikibase:lemma ?lemma .
      VALUES ?category {{ wd:Q2865743 wd:Q3813849 }}  # Definite and indefinite articles

      # Include both lemmas and forms
      {{
        ?lexeme wikibase:lemma ?article .
      }} UNION {{
        ?lexeme ontolex:lexicalForm ?form .
        ?form ontolex:representation ?article .
      }}
    }}
    """

    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()

    articles = [result["article"]["value"] for result in results["results"]["bindings"]]
    return articles

qid = "Q1860"   
articles = get_all_french_articles(qid)
for article in articles:
    print(article)

Output for English are a, an , the. We can get the qid from language_metadata.json

There is a question, Do we consider Partitive Articles QID Q576670 ?

axif0 avatar Jul 10 '24 22:07 axif0

Hey @axif0! Great work so far! And yes, including Partitive Articles is a great thought 😊 Let's definitely include those :)

Quick note: it's always good to add syntax highlighting to your code on GitHub, which can be done by adding the language or file type (python or py, for instance) to the first set of backticks :) So like this:

    ```py
    # This would have Python syntax highlighting, but it's within backticks :)
    import * from scribe_data
    ```

andrewtavis avatar Jul 11 '24 15:07 andrewtavis

Feel free to apply the function you have there to all of the machine translation files, and then we should be good for a PR! 🚀 Thanks for this :) :)

andrewtavis avatar Jul 11 '24 15:07 andrewtavis

Closed via #175 😊 Thanks for all the efforts here, @axif0! 🚀 There's doubtless still a bit left to do, but we can do the final touches in #70 when we actually run the translation process :)

andrewtavis avatar Jul 25 '24 19:07 andrewtavis