marimo icon indicating copy to clipboard operation
marimo copied to clipboard

Add Indonasian stopwords

Open hirokokinoshita opened this issue 5 years ago • 15 comments

hirokokinoshita avatar Oct 22 '20 01:10 hirokokinoshita

Thank you @hirokokinoshita for the PR. I checked the list and noticed that there are many duplicated elements. Can you remove redundant words?

> lis <- yaml::read_yaml("yaml/stopwords_id.yml")
> v <- unlist(lis)
> v[duplicated(v)]
    pronoun.possesive1     pronoun.possesive4     pronoun.possesive5     pronoun.possesive7     pronoun.possesive8 
                "saya"                 "kami"                 "kita"                 "kamu"              "saudara" 
    pronoun.possesive9    pronoun.possesive10    pronoun.possesive11    pronoun.possesive12    pronoun.possesive13 
                 "kau"               "engkau"                  "dia"                   "ia"               "mereka" 
 pronoun.interogative2 pronoun.interogative12 pronoun.interogative14 pronoun.interogative22   pronoun.contraction1 
                "yang"                "kapan"                "kalau"               "dimana"                 "saya" 
  pronoun.contraction5   pronoun.contraction8   pronoun.contraction9  pronoun.contraction10  pronoun.contraction32 
                 "ini"                 "kita"                 "kami"               "mereka"            "saya akan" 
 pronoun.contraction34  pronoun.contraction35  pronoun.contraction36  pronoun.contraction37  pronoun.contraction38 
           "kamu akan"            "anda akan"              "ia akan"             "dia akan"            "kami akan" 
 pronoun.contraction39  pronoun.contraction40            verb.basic4           verb.basic11           verb.basic14 
           "kita akan"          "mereka akan"                 "dulu"                  "ada"             "memiliki" 
          verb.basic15            verb.modal5           verb.modal14           verb.modal15           verb.modal17 
           "mempunyai"            "sebaiknya"           "seharusnya"                "patut"                 "akan" 
          verb.modal20      verb.contraction2      verb.contraction3      verb.contraction4      verb.contraction7 
             "mungkin"                "tidak"                "tidak"                "tidak"          "tidak punya" 
     verb.contraction8      verb.contraction9     verb.contraction13     verb.contraction14     verb.contraction25 
         "tidak punya"                "tidak"           "tidak akan"               "jangan"               "adalah" 
    verb.contraction26     verb.contraction27     verb.contraction28     verb.contraction29       verb.reporting13 
               "kapan"               "dimana"               "kenapa"            "bagaimana"                 "kata" 
      verb.reporting18               article4          conjunction12          conjunction14          conjunction15 
          "mengatakan"                  "itu"                 "jika"                "kalau"                 "bila" 
         conjunction16          conjunction20          conjunction33                adverb2                adverb3 
             "apabila"                "sebab"                "waktu"               "karena"               "ketika" 
               adverb4                adverb8               adverb10               adverb17               adverb25 
              "betapa"             "sehingga"               "sekali"                  "ini"                 "sang" 
              adverb33               adverb51               adverb55               adverb59               adverb65 
          "siapa saja"               "tambah"                 "amat"             "beberapa"              "seperti" 
              adverb68               adverb70               adverb71               adverb73               adverb84 
              "begitu"               "sekali"                "tidak"                "tidak"                 "maka" 
              adverb91               adverb92               adverb93               adverb97               adverb99 
                "juga"                  "pun"               "sangat"               "sekali"                  "jua" 
             adverb100              adverb101              adverb102           preposition4           preposition9 
              "sangat"                 "amat"               "begitu"               "dengan"               "dengan" 
         preposition22          preposition27          preposition28          preposition30          preposition33 
             "menjadi"               "sampai"               "selama"            "sementara"              "sebelum" 
         preposition45          preposition46          preposition52          preposition53          preposition57 
                  "ke"                 "dari"             "di bawah"                "turun"                   "di" 
         preposition59          preposition61          preposition63          preposition64          preposition70 
                "pada"               "dengan"                "untuk"                 "lagi"                "lebih" 
         preposition71          preposition72          preposition73          preposition77          preposition78 
             "di atas"              "ke atas"                 "lagi"             "di bawah"                "bawah" 
         preposition79          preposition82          preposition86          preposition87            adjectives1 
                "lagi"                 "pula"                 "dari"             "daripada"              "sendiri" 
                 time5                  time6                 time18                 time31 
                "saat"                "waktu"                "april"                  "mei" 

koheiw avatar Oct 22 '20 04:10 koheiw

Thanks. I update with the file you sent me via email, but there are still many duplicates. It is easy to check with R commands:

lis <- yaml::read_yaml("yaml/stopwords_id.yml")
v <- unlist(lis)
v[duplicated(v)]

koheiw avatar Nov 02 '20 20:11 koheiw

I checked with R commands you taught me and deleted some duplication but not all in order to maintaining the grammatical structure. I pushed the file via R.

hirokokinoshita avatar Nov 04 '20 03:11 hirokokinoshita

I agree with you that it is difficult to classify words into a single functional category, but I think we should do this based on the most common usage of the words to avoid duplicates. Is it possible?

koheiw avatar Nov 05 '20 11:11 koheiw

By the way, did you push your commits? I seems you did not because nothing appears after my commits in https://github.com/koheiw/marimo/pull/7/commits.

koheiw avatar Nov 05 '20 11:11 koheiw

I have just pushed my commits. Seems it successful, I suppose.

hirokokinoshita avatar Nov 06 '20 05:11 hirokokinoshita

Great that you understand how to use Git! I added original English words as comment to make it easy to work on the file.

There are few issues that need to solve before merging.

  • We still have duplicates.
stopwords.verb.contraction14 stopwords.verb.contraction15 stopwords.verb.contraction16 
                    "adalah"                      "kapan"                     "dimana" 
stopwords.verb.contraction17 stopwords.verb.contraction18           stopwords.article4 
                    "kenapa"                  "bagaimana"                        "itu" 
      stopwords.conjunction8      stopwords.conjunction10            stopwords.adverb5 
                      "jika"                       "bila"                   "sehingga" 
          stopwords.adverb27           stopwords.adverb31       stopwords.preposition3 
                    "tambah"                       "amat"                     "dengan" 
      stopwords.preposition8      stopwords.preposition19      stopwords.preposition33 
                    "dengan"                    "menjadi"                      "lebih" 
     stopwords.preposition36        stopwords.adjectives1              stopwords.time5 
                      "pula"                    "sendiri"                       "saat" 
             stopwords.time6 
                     "waktu" 
  • Contractions are short-handed forms but [inilah, itulah, sudah, punya, akan, ingin] do not look like so. We can omit the category if there is not contraction in the language.
  • Do you really need disini adalah as a phrase? If you write them as separate words (disini, adalah), you can remove "disini adalah". The same apply to all the phrases.
  • Don't you have modal verbs in Indonesian?
  • Can you add day of a week and number in the ordinal form?
  • Please check the order of the translation roughly matches the original English. Feel free to segment the lists if they are too long (see stopwords_ar.yml)

koheiw avatar Nov 06 '20 22:11 koheiw

Thank you for adding original English words! It helped me to brush-up the dictionary.

  • Contractions are short-handed forms but [inilah, itulah, sudah, punya, akan, ingin] do not look like so. We can omit the category if there is not contraction in the language. I agree. There is no suitable word and phrase equivalent to contractions in English. I just deleted the words.
  • Do you really need disini adalah as a phrase? If you write them as separate words (disini, adalah), you can remove "disini adalah". The same apply to all the phrases. I removed "disini adalah".
  • Don't you have modal verbs in Indonesian? It seems that I accidentally removed the words. I added modal verbs again.
  • Can you add day of a week and number in the ordinal form? I added :-)
  • Please check the order of the translation roughly matches the original English. Feel free to segment the lists if they are too long (see stopwords_ar.yml) I inserted several segments.

Thanks !

hirokokinoshita avatar Nov 11 '20 07:11 hirokokinoshita

Sorry for not replying earlier. As far as I can tell using Google Translate, "tdk dapat", "tdk bisa" seem like equivalent of contractions, but "tidak", "bukan" are simple negation equivalent to "no" or "not".

If this is true, please move simple negations to adverb where "no" is placed in English. Then, remove those comprise of simple negations and modal verbs like "tidak bisa" and "tidak akan". We can also add "tdk" to adjectives and remove "tdk dapat", "tdk bisa" if it does not make patterns too ambiguous.

koheiw avatar Nov 15 '20 23:11 koheiw

Sorry that did something complicated :-(. The word bukan is negation and used only when denying a noun, so we can't use "bukan" when denying a verb. "tidak" is simple negation, but can be used in combination with verbs etc.

hirokokinoshita avatar Nov 19 '20 07:11 hirokokinoshita

Thank you for explaining.

If "bukan" is used only to negate verbs, let's include as a single word along with "tidak" in adverb.

"akan" and "bisa" should be single words because multi-word expressions are uncommon in stopwords lists.

koheiw avatar Nov 19 '20 12:11 koheiw

OK. Then I will add "bukan" along with "tidak" in adverb section. "akan" means "will" and "bisa" means "can", so if these words should be used in single style, we can't assign contraction expression such as "won't" and "can't" in Indonesian. Should I remove "tidak akan" and "tidak bisa" from the list?

hirokokinoshita avatar Nov 20 '20 05:11 hirokokinoshita

Please remove "tidak akan" and "tidak bisa". "akan" and "bisa" are correctly in the modal verbs category. If there is no contraction in the language, you can remove the category.

koheiw avatar Nov 20 '20 05:11 koheiw

Thank you Kohei! I removed contraction category.

hirokokinoshita avatar Nov 20 '20 05:11 hirokokinoshita

I found a few more multi-word entries. You are almost done!

"milik saya" "di antara"  "ke dalam"   "di atas"    "di bawah"

koheiw avatar Dec 11 '20 10:12 koheiw