Add Indonasian stopwords
Thank you @hirokokinoshita for the PR. I checked the list and noticed that there are many duplicated elements. Can you remove redundant words?
> lis <- yaml::read_yaml("yaml/stopwords_id.yml")
> v <- unlist(lis)
> v[duplicated(v)]
pronoun.possesive1 pronoun.possesive4 pronoun.possesive5 pronoun.possesive7 pronoun.possesive8
"saya" "kami" "kita" "kamu" "saudara"
pronoun.possesive9 pronoun.possesive10 pronoun.possesive11 pronoun.possesive12 pronoun.possesive13
"kau" "engkau" "dia" "ia" "mereka"
pronoun.interogative2 pronoun.interogative12 pronoun.interogative14 pronoun.interogative22 pronoun.contraction1
"yang" "kapan" "kalau" "dimana" "saya"
pronoun.contraction5 pronoun.contraction8 pronoun.contraction9 pronoun.contraction10 pronoun.contraction32
"ini" "kita" "kami" "mereka" "saya akan"
pronoun.contraction34 pronoun.contraction35 pronoun.contraction36 pronoun.contraction37 pronoun.contraction38
"kamu akan" "anda akan" "ia akan" "dia akan" "kami akan"
pronoun.contraction39 pronoun.contraction40 verb.basic4 verb.basic11 verb.basic14
"kita akan" "mereka akan" "dulu" "ada" "memiliki"
verb.basic15 verb.modal5 verb.modal14 verb.modal15 verb.modal17
"mempunyai" "sebaiknya" "seharusnya" "patut" "akan"
verb.modal20 verb.contraction2 verb.contraction3 verb.contraction4 verb.contraction7
"mungkin" "tidak" "tidak" "tidak" "tidak punya"
verb.contraction8 verb.contraction9 verb.contraction13 verb.contraction14 verb.contraction25
"tidak punya" "tidak" "tidak akan" "jangan" "adalah"
verb.contraction26 verb.contraction27 verb.contraction28 verb.contraction29 verb.reporting13
"kapan" "dimana" "kenapa" "bagaimana" "kata"
verb.reporting18 article4 conjunction12 conjunction14 conjunction15
"mengatakan" "itu" "jika" "kalau" "bila"
conjunction16 conjunction20 conjunction33 adverb2 adverb3
"apabila" "sebab" "waktu" "karena" "ketika"
adverb4 adverb8 adverb10 adverb17 adverb25
"betapa" "sehingga" "sekali" "ini" "sang"
adverb33 adverb51 adverb55 adverb59 adverb65
"siapa saja" "tambah" "amat" "beberapa" "seperti"
adverb68 adverb70 adverb71 adverb73 adverb84
"begitu" "sekali" "tidak" "tidak" "maka"
adverb91 adverb92 adverb93 adverb97 adverb99
"juga" "pun" "sangat" "sekali" "jua"
adverb100 adverb101 adverb102 preposition4 preposition9
"sangat" "amat" "begitu" "dengan" "dengan"
preposition22 preposition27 preposition28 preposition30 preposition33
"menjadi" "sampai" "selama" "sementara" "sebelum"
preposition45 preposition46 preposition52 preposition53 preposition57
"ke" "dari" "di bawah" "turun" "di"
preposition59 preposition61 preposition63 preposition64 preposition70
"pada" "dengan" "untuk" "lagi" "lebih"
preposition71 preposition72 preposition73 preposition77 preposition78
"di atas" "ke atas" "lagi" "di bawah" "bawah"
preposition79 preposition82 preposition86 preposition87 adjectives1
"lagi" "pula" "dari" "daripada" "sendiri"
time5 time6 time18 time31
"saat" "waktu" "april" "mei"
Thanks. I update with the file you sent me via email, but there are still many duplicates. It is easy to check with R commands:
lis <- yaml::read_yaml("yaml/stopwords_id.yml")
v <- unlist(lis)
v[duplicated(v)]
I checked with R commands you taught me and deleted some duplication but not all in order to maintaining the grammatical structure. I pushed the file via R.
I agree with you that it is difficult to classify words into a single functional category, but I think we should do this based on the most common usage of the words to avoid duplicates. Is it possible?
By the way, did you push your commits? I seems you did not because nothing appears after my commits in https://github.com/koheiw/marimo/pull/7/commits.
I have just pushed my commits. Seems it successful, I suppose.
Great that you understand how to use Git! I added original English words as comment to make it easy to work on the file.
There are few issues that need to solve before merging.
- We still have duplicates.
stopwords.verb.contraction14 stopwords.verb.contraction15 stopwords.verb.contraction16
"adalah" "kapan" "dimana"
stopwords.verb.contraction17 stopwords.verb.contraction18 stopwords.article4
"kenapa" "bagaimana" "itu"
stopwords.conjunction8 stopwords.conjunction10 stopwords.adverb5
"jika" "bila" "sehingga"
stopwords.adverb27 stopwords.adverb31 stopwords.preposition3
"tambah" "amat" "dengan"
stopwords.preposition8 stopwords.preposition19 stopwords.preposition33
"dengan" "menjadi" "lebih"
stopwords.preposition36 stopwords.adjectives1 stopwords.time5
"pula" "sendiri" "saat"
stopwords.time6
"waktu"
- Contractions are short-handed forms but
[inilah, itulah, sudah, punya, akan, ingin]do not look like so. We can omit the category if there is not contraction in the language. - Do you really need
disini adalahas a phrase? If you write them as separate words (disini, adalah), you can remove "disini adalah". The same apply to all the phrases. - Don't you have modal verbs in Indonesian?
- Can you add day of a week and number in the ordinal form?
- Please check the order of the translation roughly matches the original English. Feel free to segment the lists if they are too long (see stopwords_ar.yml)
Thank you for adding original English words! It helped me to brush-up the dictionary.
- Contractions are short-handed forms but
[inilah, itulah, sudah, punya, akan, ingin]do not look like so. We can omit the category if there is not contraction in the language. I agree. There is no suitable word and phrase equivalent to contractions in English. I just deleted the words.
- Do you really need
disini adalahas a phrase? If you write them as separate words (disini, adalah), you can remove "disini adalah". The same apply to all the phrases. I removed "disini adalah".
- Don't you have modal verbs in Indonesian? It seems that I accidentally removed the words. I added modal verbs again.
- Can you add day of a week and number in the ordinal form? I added :-)
- Please check the order of the translation roughly matches the original English. Feel free to segment the lists if they are too long (see stopwords_ar.yml) I inserted several segments.
Thanks !
Sorry for not replying earlier. As far as I can tell using Google Translate, "tdk dapat", "tdk bisa" seem like equivalent of contractions, but "tidak", "bukan" are simple negation equivalent to "no" or "not".
If this is true, please move simple negations to adverb where "no" is placed in English. Then, remove those comprise of simple negations and modal verbs like "tidak bisa" and "tidak akan". We can also add "tdk" to adjectives and remove "tdk dapat", "tdk bisa" if it does not make patterns too ambiguous.
Sorry that did something complicated :-(. The word bukan is negation and used only when denying a noun, so we can't use "bukan" when denying a verb. "tidak" is simple negation, but can be used in combination with verbs etc.
Thank you for explaining.
If "bukan" is used only to negate verbs, let's include as a single word along with "tidak" in adverb.
"akan" and "bisa" should be single words because multi-word expressions are uncommon in stopwords lists.
OK. Then I will add "bukan" along with "tidak" in adverb section. "akan" means "will" and "bisa" means "can", so if these words should be used in single style, we can't assign contraction expression such as "won't" and "can't" in Indonesian. Should I remove "tidak akan" and "tidak bisa" from the list?
Please remove "tidak akan" and "tidak bisa". "akan" and "bisa" are correctly in the modal verbs category. If there is no contraction in the language, you can remove the category.
Thank you Kohei! I removed contraction category.
I found a few more multi-word entries. You are almost done!
"milik saya" "di antara" "ke dalam" "di atas" "di bawah"