Weird email and username in Chinese locale package

Open shtse8 opened this issue 2 years ago • 9 comments

Describe the bug

email and username should not using Chinese even in Chinese locale package. there is no one using Chinese as an email and username even in Chinese.

Reproduction

code

// import { faker } from '@faker-js/faker';
import { faker } from '@faker-js/faker/locale/zh_CN'

export const USERS: User[] = []

export function createRandomUser(): User {
  return {
    userId: faker.datatype.uuid(),
    username: faker.internet.userName(),
    email: faker.internet.email(),
    avatar: faker.image.avatar(),
    password: faker.internet.password(),
    birthdate: faker.date.birthdate(),
    registeredAt: faker.date.past(),
  }
}

Array.from({ length: 1 }).forEach(() => {
  USERS.push(createRandomUser())
})

console.log(USERS)

output

[
  {
    userId: '88d30bb6-c783-4e56-8ffc-6778ec6e1c0a',
    username: '钰轩.侯68',
    email: '明杰_彭@gmail.com',
    avatar: 'https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/765.jpg',
    password: 'UdVxsDkMWFajEId',
    birthdate: 1964-10-12T19:43:31.378Z,
    registeredAt: 2022-04-27T11:56:33.741Z
  }
]

Additional Info

No response

Jun 24 '22 01:06 shtse8

https://en.wikipedia.org/wiki/International_email#Email_addresses 🤔

Jun 24 '22 07:06 Shinigami92

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

Jun 24 '22 10:06 ST-DDT

https://en.wikipedia.org/wiki/International_email#Email_addresses 🤔

it is not the case. as there is possible to support Chinese in domain, username and email in theory and in standard. but it's not in practical. Chinese is very difficult to input comparing other languages.

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

because there is not possible to use Chinese in email and username most of the time on any site, which won't allow to input due to difficult to handle in tech way, parsing Chinese is relatively difficult. also, it's much easier to enter in English which can be directly from keyboard - one char by one char.

in Chinese world, there are many ways to transcribe our Chinese name to English. In Hong Kong, we are using our English name or Cantonese phonic name on our id card. For example, surname 陳 is Chan, surname 張 is Cheung, first name 恒isHang. so if someone called 張恒`, his might use "Cheung Hang" as his English name. useing "cheunghang" as username, and using "[email protected]" as email.

Many of us have read English name taken by ourselves like Peter, Simon. so if 張恒 takes a English as Peter. He might take Peter Cheung as his English display name. as use it on username and email.

In Mainland China, Taiwan and other Mandarin speaking places like Sigapore, Malysia, they are using Pinyin (Mandarin phonic), for example, surname 陳 (Traditional Chinese) or 陈 (Simplified Chinese) is Chen, surname "張" or 张 is Zhang, first name 恒isHeng`. so if someone called "張恒", his might use "Zheng Heng" as his English name. useing "zhangheng" as username, and using "[email protected]" as email.

Let's take a look on DouYin (抖音) (Chinese version TikTok) https://www.douyin.com/user/MS4wLjABAAAAvOpuhpSOPCAvoa6Slgg54m1DtiTBR4ac003SlM86yoxlmMF3AnnF2c8LzHEocAMj

`抖音号` is username on the platform. this user picked `Sariel_740399`. I guess `Sariel` is his English name and `740399` is something meaning to her, like birthday?

https://www.douyin.com/user/MS4wLjABAAAApDszKVp0whQtJRUaaDmKnrshCmZ5gwZwcXXnvYsAUFE this user picked wobushixumengjie. while her Chinese name is 洁梦徐, last name should put on the front in Chinese. So her real Chinese name should be 徐梦洁, she just reverse enter her name. Pinyin of 徐梦洁 is Xu Meng Jie which is part of her username. wobushi is the Pinyin of 我不是 (meaning I am not) which is Wo Bu Shi.

Hope it can help to be more fake on faker

Jun 25 '22 05:06 shtse8

Just my opinion and idea:

I feel like this breaks out of scope for faker itself. It uses a simple algorithm right now where a first name and last name are just inserted for the email. Faker is not a converter library that specifically converts chinese to english names.

So my proposal (and we can freely discuss about that) would be:

Create/Use a package, to covert chinese names to english counterparts and pass them into the email function of faker.

Jun 25 '22 07:06 Shinigami92

IMO we could probably add a locale like en_CN that contains some Chinese sounding (first?/)lastnames, so it possible to generate Peter Cheung as "English" version of the Chinese name, which will then be used to generate the email.

However, this would be up to the user to explicitly select as locale, because technically it not Chinese anymore and phonetically converting the text probably takes more than 50 lines of code. And some users might explicitly want chinese usernames and email addresses, because they have to verify, that it works with those as well. (In Germany, it is possible to use Umlaute äöüß in E-Mail Addresses. Yes, it is rare, but some people prefer it over the "asci" converted variants (ae, oe, ue, sz).)

export function createRandomUser(): User {
  return {
    userId: fakerZH.datatype.uuid(),
    username: fakerEN_CN.internet.userName(),
    email: fakerEN_CN.internet.email(),
    avatar: fakerZH.image.avatar(),
    password: fakerZH.internet.password(),
    birthdate: fakerZH.date.birthdate(),
    registeredAt: fakerZH.date.past(),
  }
}

If we add some kind of internal workaround, to delegate to the English Faker ourselves, then we won't be able to split faker into individual locale modules anymore.

@shtse8 What do you think about the en_CN locale approach?

Jun 25 '22 18:06 ST-DDT

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

There is a romanization system for Chinese characters called "pinyin" as @shtse8 said, but I'm not sure if there's an easy way to transliterate characters into it. I'll look into it.

Edit: Problem is, some Chinese characters have multiple ways to pronounce them based on context :/

Jun 25 '22 18:06 import-brain

and just one google search away, typing in pinyin npm, the first result is: https://www.npmjs.com/package/pinyin

and there are even alternativ packages

so I think this is currently the best workaround for now

according to this answer on stackoverflow: https://stackoverflow.com/a/760151/6897682 we might want to think about an option to allow/disallow non-english letters and switch strategy based on that I wont like to have a special case just for chinese in our code base

Jun 26 '22 08:06 Shinigami92

Today another "affected" method and locale showed up: internet.domainWord() https://discord.com/channels/929487054990110771/929544565348777984/990970477138833428

We might have to add an option onlyAscii or similar to some of the internet methods.

Jun 27 '22 16:06 ST-DDT

Especially with internet.domainWord() (or internet.domain() for that matter) it's kind of annoying b/c it leads to our CI failing over and over again (as we validate domain inputs) and always b/c of the word jalapeño which is randomly appearing.

From what I understand, not all TLDs are even accepting internationalized domain names (wiki), so I think it is out of scope for faker to determine which are and keep track of that. Imo, domain words should just not include non-ASCII chars to keep it simple.

Jul 17 '22 23:07 schw4rzlicht

Perhaps locales which aren't in ASCII script should optionally be able to provide an alternative set of ASCII first names and last names to be used in contexts that require ascii like email addresses? For example zh_CN, ar, el

Nov 07 '22 03:11 matthewmayer

Sample output for

    Object.keys(faker.locales).forEach(locale=>{faker.setLocale(locale); console.log(`${locale}: ${faker.internet.email()}`)})

af_ZA: [email protected]
ar: [email protected]
az: [email protected]
cz: [email protected]
de: [email protected]
de_AT: [email protected]
de_CH: [email protected]
el: [email protected]
en: [email protected]
en_AU: [email protected]
en_AU_ocker: [email protected]
en_BORK: [email protected]
en_CA: [email protected]
en_GB: [email protected]
en_GH: [email protected]
en_IE: [email protected]
en_IN: [email protected]
en_NG: [email protected]
en_US: [email protected]
en_ZA: [email protected]
es: [email protected]
es_MX: [email protected]
fa: [email protected]
fi: [email protected]
fr: [email protected]
fr_BE: [email protected]
fr_CA: [email protected]
fr_CH: [email protected]
ge: [email protected]
he: [email protected]
hr: [email protected]
hu: [email protected]
hy: [email protected]
id_ID: [email protected]
it: [email protected]
ja: 太一.中村[email protected]
ko: [email protected]
lv: [email protected]
mk: [email protected]
nb_NO: [email protected]
ne: [email protected]
nl: [email protected]
nl_BE: [email protected]
pl: [email protected]
pt_BR: [email protected]
pt_PT: [email protected]
ro: [email protected]
ru: [email protected]
sk: [email protected]
sv: [email protected]
tr: [email protected]
uk: [email protected]
ur: [email protected]
vi: [email protected]
zh_CN: 鑫鹏_宋@gmail.com
zh_TW: 樂駒[email protected]
zu_ZA: [email protected]

I note there are two groups of locales with slightly different problems zh_CN, zh_TW and ja contain unstripped non-ASCII characters

ar, el, fa, ge, he, hy, ko, mk, ur are stripped down and generally only contain _.01234567890, often giving an invalid address like [email protected]

Nov 12 '22 13:11 matthewmayer

The difference seems to come down to the fact that faker.helpers.slugify has some exceptions for Japanese and Chinese characters

https://github.com/faker-js/faker/blame/next/src/modules/helpers/index.ts#L37

slugify(string: string = ''): string {
    return string
      .replace(/ /g, '-')
      .replace(/[^\一-龠\ぁ-ゔ\ァ-ヴー\w\.\-]+/g, '');
  }

Note the Chinese and Japanese characters here are not stripped but Cyrillic, Arabic, Korean are:

faker.helpers.slugify("ABCD123 靖琪 結衣 용환.예 Саве.Панговски زینہ81") //'ABCD123-靖琪-結衣-.-.-81'

Nov 12 '22 13:11 matthewmayer

... and that was originally introduced here: https://github.com/faker-js/faker/commit/0d3809d4c83f9f5c29d99040df84b7353fe32255

It seems to have caused more problems than it solved, so perhaps that could be reverted, and a more general solution found for all the non-ascii-ish locales.

Nov 12 '22 13:11 matthewmayer

I dont think that @example.com is any more useful than <InsertChineseCharactersHere>@example.com.

Nov 12 '22 13:11 ST-DDT

as a simple solution, in non-ascii locales you could just make a purely random localPart for email addresses like two letters, followed by 5-8 numbers, e.g.

[email protected]

... at least it would be a valid email address.

Nov 12 '22 13:11 matthewmayer

i created #1554 as a tentative solution for this. Not sure would be the best long term solution but it at least means that all locales return valid, ascii, email addresses.

Nov 13 '22 10:11 matthewmayer

email and username should not using Chinese even in Chinese locale package. there is no one using Chinese as an email and username even in Chinese.

At least, as for email addresses, the same goes for the Japan. (If you enter a Japanese email address, it will be rejected by validation, even on most systems used in Japan)

as a simple solution, in non-ascii locales you could just make a purely random localPart for email addresses like two letters, followed by 5-8 numbers

I think this fix will help!

Nov 20 '22 15:11 kz-d

Thanks @kz-d good to get a Japanese opinion too :) I guess the #1554 PR will help with #1437 also

Nov 20 '22 16:11 matthewmayer

faker faker copied to clipboard

Weird email and username in Chinese locale package

Describe the bug

Reproduction

code

output

Additional Info

faker
faker copied to clipboard