address-net icon indicating copy to clipboard operation
address-net copied to clipboard

Retrain model

Open 1653100 opened this issue 5 years ago • 2 comments

Hello, I had used your package and it is very usefull. But the my data is formatted in UNICODE, which is Vietnamese, and it not working well. So can i use your code to retrain a new model for my own Vietnamese data? If yes, can you please help me? Thank you a lot. For UNICODE example, "Số nhà 25, ngõ 294 Kim Mã, Phường Kim Mã, Quận Ba Đình, Thành phố Hà Nội". "street" is now "ngõ", "state" is now "Quận", ... Sorry for my bad english, Looking forward to hearing from you soon.

1653100 avatar Feb 13 '20 08:02 1653100

Your English is completely fine, don't worry!

This model is trained only on Australian address data, so it will not work at all for Vietnamese addresses, and probably it will have a lot of problems with any other country.

The model itself is quite simple, so you can retrain it. You can see from my answer in issue #10 that the model produces one class per character. Since you are using unicode characters for the Vietnamese language, there are many more possible characters than the standard English alphabet (e.g. ă, â, đ, ê, ô, ơ). So, you have a choice:

  1. expand the number of possible characters ("vocabulary") to be bigger
  2. find a method to reduce the characters with accent marks back to their base character, e.g. ă, â -> a

Once you have decided how you will approach the problem, you need to find a structured database of addresses. You can use this to automatically generate labelled training data.

jasonrig avatar Feb 21 '20 04:02 jasonrig

Thank you so much. Your answer helped me a lot. I have rebuilt the model using keras, and it ran well. Even though it doesn't work as well as yours, the predict is still mislabeled by wrong letters. By the way, can i have your model outline, like the order of layers, the number of layers, .... Once again, thank you a lot. ^^ Have a nice day.

1653100 avatar Mar 14 '20 16:03 1653100