faker
faker copied to clipboard
Weird email and username in Chinese locale package
Describe the bug
email and username should not using Chinese even in Chinese locale package. there is no one using Chinese as an email and username even in Chinese.
Reproduction
code
// import { faker } from '@faker-js/faker';
import { faker } from '@faker-js/faker/locale/zh_CN'
export const USERS: User[] = []
export function createRandomUser(): User {
return {
userId: faker.datatype.uuid(),
username: faker.internet.userName(),
email: faker.internet.email(),
avatar: faker.image.avatar(),
password: faker.internet.password(),
birthdate: faker.date.birthdate(),
registeredAt: faker.date.past(),
}
}
Array.from({ length: 1 }).forEach(() => {
USERS.push(createRandomUser())
})
console.log(USERS)
output
[
{
userId: '88d30bb6-c783-4e56-8ffc-6778ec6e1c0a',
username: '钰轩.侯68',
email: '明杰_彭@gmail.com',
avatar: 'https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/765.jpg',
password: 'UdVxsDkMWFajEId',
birthdate: 1964-10-12T19:43:31.378Z,
registeredAt: 2022-04-27T11:56:33.741Z
}
]
Additional Info
No response
https://en.wikipedia.org/wiki/International_email#Email_addresses 🤔
@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?
Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).
https://en.wikipedia.org/wiki/International_email#Email_addresses 🤔
it is not the case. as there is possible to support Chinese in domain, username and email in theory and in standard. but it's not in practical. Chinese is very difficult to input comparing other languages.
@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?
Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).
because there is not possible to use Chinese in email and username most of the time on any site, which won't allow to input due to difficult to handle in tech way, parsing Chinese is relatively difficult. also, it's much easier to enter in English which can be directly from keyboard - one char by one char.
in Chinese world, there are many ways to transcribe our Chinese name to English. In Hong Kong, we are using our English name or Cantonese phonic name on our id card. For example, surname 陳
is Chan
, surname 張
is Cheung, first name
恒is
Hang. so if someone called
張恒`, his might use "Cheung Hang" as his English name. useing "cheunghang" as username, and using "[email protected]" as email.
Many of us have read English name taken by ourselves like Peter
, Simon
. so if 張恒
takes a English as Peter
. He might take Peter Cheung
as his English display name. as use it on username and email.
In Mainland China, Taiwan and other Mandarin speaking places like Sigapore, Malysia, they are using Pinyin (Mandarin phonic), for example, surname 陳
(Traditional Chinese) or 陈
(Simplified Chinese) is Chen
, surname "張" or 张
is Zhang, first name
恒is
Heng`. so if someone called "張恒", his might use "Zheng Heng" as his English name. useing "zhangheng" as username, and using "[email protected]" as email.
Let's take a look on DouYin (抖音) (Chinese version TikTok) https://www.douyin.com/user/MS4wLjABAAAAvOpuhpSOPCAvoa6Slgg54m1DtiTBR4ac003SlM86yoxlmMF3AnnF2c8LzHEocAMj
![image](https://user-images.githubusercontent.com/8020099/175759228-72ce9a98-9607-45d8-8977-41f1f077431c.png)
https://www.douyin.com/user/MS4wLjABAAAApDszKVp0whQtJRUaaDmKnrshCmZ5gwZwcXXnvYsAUFE
this user picked
wobushixumengjie
. while her Chinese name is 洁梦徐
, last name should put on the front in Chinese. So her real Chinese name should be 徐梦洁
, she just reverse enter her name. Pinyin of 徐梦洁
is Xu Meng Jie
which is part of her username. wobushi
is the Pinyin of 我不是
(meaning I am not) which is Wo Bu Shi
.
Hope it can help to be more fake
on faker
Just my opinion and idea:
I feel like this breaks out of scope for faker itself. It uses a simple algorithm right now where a first name and last name are just inserted for the email. Faker is not a converter library that specifically converts chinese to english names.
So my proposal (and we can freely discuss about that) would be:
Create/Use a package, to covert chinese names to english counterparts and pass them into the email function of faker.
IMO we could probably add a locale like en_CN
that contains some Chinese sounding (first?/)lastnames, so it possible to generate Peter Cheung
as "English" version of the Chinese name, which will then be used to generate the email.
However, this would be up to the user to explicitly select as locale, because technically it not Chinese anymore and phonetically converting the text probably takes more than 50 lines of code. And some users might explicitly want chinese usernames and email addresses, because they have to verify, that it works with those as well. (In Germany, it is possible to use Umlaute äöüß
in E-Mail Addresses. Yes, it is rare, but some people prefer it over the "asci" converted variants (ae, oe, ue, sz).)
export function createRandomUser(): User {
return {
userId: fakerZH.datatype.uuid(),
username: fakerEN_CN.internet.userName(),
email: fakerEN_CN.internet.email(),
avatar: fakerZH.image.avatar(),
password: fakerZH.internet.password(),
birthdate: fakerZH.date.birthdate(),
registeredAt: fakerZH.date.past(),
}
}
If we add some kind of internal workaround, to delegate to the English Faker ourselves, then we won't be able to split faker into individual locale modules anymore.
@shtse8 What do you think about the en_CN
locale approach?
@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?
Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).
There is a romanization system for Chinese characters called "pinyin" as @shtse8 said, but I'm not sure if there's an easy way to transliterate characters into it. I'll look into it.
Edit: Problem is, some Chinese characters have multiple ways to pronounce them based on context :/
and just one google search away, typing in pinyin npm
, the first result is:
https://www.npmjs.com/package/pinyin
and there are even alternativ packages
so I think this is currently the best workaround for now
according to this answer on stackoverflow: https://stackoverflow.com/a/760151/6897682 we might want to think about an option to allow/disallow non-english letters and switch strategy based on that I wont like to have a special case just for chinese in our code base
Today another "affected" method and locale showed up: internet.domainWord()
https://discord.com/channels/929487054990110771/929544565348777984/990970477138833428
We might have to add an option onlyAscii
or similar to some of the internet methods.
Especially with internet.domainWord()
(or internet.domain()
for that matter) it's kind of annoying b/c it leads to our CI failing over and over again (as we validate domain inputs) and always b/c of the word jalapeño
which is randomly appearing.
From what I understand, not all TLDs are even accepting internationalized domain names (wiki), so I think it is out of scope for faker to determine which are and keep track of that. Imo, domain words should just not include non-ASCII chars to keep it simple.
Perhaps locales which aren't in ASCII script should optionally be able to provide an alternative set of ASCII first names and last names to be used in contexts that require ascii like email addresses? For example zh_CN, ar, el
Sample output for
Object.keys(faker.locales).forEach(locale=>{faker.setLocale(locale); console.log(`${locale}: ${faker.internet.email()}`)})
af_ZA: [email protected]
ar: [email protected]
az: [email protected]
cz: [email protected]
de: [email protected]
de_AT: [email protected]
de_CH: [email protected]
el: [email protected]
en: [email protected]
en_AU: [email protected]
en_AU_ocker: [email protected]
en_BORK: [email protected]
en_CA: [email protected]
en_GB: [email protected]
en_GH: [email protected]
en_IE: [email protected]
en_IN: [email protected]
en_NG: [email protected]
en_US: [email protected]
en_ZA: [email protected]
es: [email protected]
es_MX: [email protected]
fa: [email protected]
fi: [email protected]
fr: [email protected]
fr_BE: [email protected]
fr_CA: [email protected]
fr_CH: [email protected]
ge: [email protected]
he: [email protected]
hr: [email protected]
hu: [email protected]
hy: [email protected]
id_ID: [email protected]
it: [email protected]
ja: 太一.中村[email protected]
ko: [email protected]
lv: [email protected]
mk: [email protected]
nb_NO: [email protected]
ne: [email protected]
nl: [email protected]
nl_BE: [email protected]
pl: [email protected]
pt_BR: [email protected]
pt_PT: [email protected]
ro: [email protected]
ru: [email protected]
sk: [email protected]
sv: [email protected]
tr: [email protected]
uk: [email protected]
ur: [email protected]
vi: [email protected]
zh_CN: 鑫鹏_宋@gmail.com
zh_TW: 樂駒[email protected]
zu_ZA: [email protected]
I note there are two groups of locales with slightly different problems
zh_CN
, zh_TW
and ja
contain unstripped non-ASCII characters
ar, el, fa, ge, he, hy, ko, mk, ur
are stripped down and generally only contain _.01234567890
, often giving an invalid address like [email protected]
The difference seems to come down to the fact that faker.helpers.slugify has some exceptions for Japanese and Chinese characters
https://github.com/faker-js/faker/blame/next/src/modules/helpers/index.ts#L37
slugify(string: string = ''): string {
return string
.replace(/ /g, '-')
.replace(/[^\一-龠\ぁ-ゔ\ァ-ヴー\w\.\-]+/g, '');
}
Note the Chinese and Japanese characters here are not stripped but Cyrillic, Arabic, Korean are:
faker.helpers.slugify("ABCD123 靖琪 結衣 용환.예 Саве.Панговски زینہ81") //'ABCD123-靖琪-結衣-.-.-81'
... and that was originally introduced here: https://github.com/faker-js/faker/commit/0d3809d4c83f9f5c29d99040df84b7353fe32255
It seems to have caused more problems than it solved, so perhaps that could be reverted, and a more general solution found for all the non-ascii-ish locales.
I dont think that @example.com
is any more useful than <InsertChineseCharactersHere>@example.com
.
as a simple solution, in non-ascii locales you could just make a purely random localPart for email addresses like two letters, followed by 5-8 numbers, e.g.
... at least it would be a valid email address.
i created #1554 as a tentative solution for this. Not sure would be the best long term solution but it at least means that all locales return valid, ascii, email addresses.
email and username should not using Chinese even in Chinese locale package. there is no one using Chinese as an email and username even in Chinese.
At least, as for email addresses, the same goes for the Japan. (If you enter a Japanese email address, it will be rejected by validation, even on most systems used in Japan)
as a simple solution, in non-ascii locales you could just make a purely random localPart for email addresses like two letters, followed by 5-8 numbers
I think this fix will help!
Thanks @kz-d good to get a Japanese opinion too :) I guess the #1554 PR will help with #1437 also