faker icon indicating copy to clipboard operation
faker copied to clipboard

test(date): Add Intl-Based Tests for Date Definitions to Ensure Completeness of Weekdays and Months

Open sossost opened this issue 1 year ago • 28 comments

#2904 This PR is about Intl-based tests for the date definitions in Faker.js. These tests ensure that the weekdays and months are correctly defined and returned across different locales, enhancing the accuracy and completeness of the data.

sossost avatar May 18 '24 14:05 sossost

Deploy Preview for fakerjs ready!

Name Link
Latest commit 78c7183a1fbcb70a9210b5fcd3c712016910f513
Latest deploy log https://app.netlify.com/sites/fakerjs/deploys/667202f079e1040008bd04f0
Deploy Preview https://deploy-preview-2912.fakerjs.dev
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar May 18 '24 14:05 netlify[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.95%. Comparing base (80d5aad) to head (78c7183). Report is 194 commits behind head on next.

Additional details and impacted files
@@            Coverage Diff             @@
##             next    #2912      +/-   ##
==========================================
- Coverage   99.95%   99.95%   -0.01%     
==========================================
  Files        2987     3051      +64     
  Lines      216052   217787    +1735     
  Branches      603      948     +345     
==========================================
+ Hits       215952   217686    +1734     
- Misses        100      101       +1     
Files with missing lines Coverage Δ
src/locales/af_ZA/date/index.ts 100.00% <100.00%> (ø)
src/locales/af_ZA/date/month.ts 100.00% <100.00%> (ø)
src/locales/af_ZA/date/weekday.ts 100.00% <100.00%> (ø)
src/locales/af_ZA/index.ts 100.00% <100.00%> (ø)
src/locales/ar/date/month.ts 100.00% <100.00%> (ø)
src/locales/ar/date/weekday.ts 100.00% <100.00%> (ø)
src/locales/az/date/month.ts 100.00% <100.00%> (ø)
src/locales/az/date/weekday.ts 100.00% <100.00%> (ø)
src/locales/cs_CZ/date/month.ts 100.00% <100.00%> (ø)
src/locales/cs_CZ/date/weekday.ts 100.00% <100.00%> (ø)
... and 143 more

... and 1 file with indirect coverage changes

codecov[bot] avatar May 18 '24 16:05 codecov[bot]

Is this PR ready for review yet? Because I dont see Intl in the tests anywhere.

ST-DDT avatar May 18 '24 16:05 ST-DDT

@ST-DDT If I create tests with JavaScript Intl.DateTimeForamt, most of them will not pass. Did I proceed correctly??

     it('should use Intl.DateTimeFormat to get the month name in the correct locale', () => {
          const date = new Date(2020, 0, 1);
          const intlMonth = new Intl.DateTimeFormat(locale, {
            month: 'long',
          }).format(date);
          expect(localizedFaker.definitions.date.month.wide).toContain(
            intlMonth
          );
      });

      it('should use Intl.DateTimeFormat to get the abbreviated month name in the correct locale', () => {
          const date = new Date(2020, 0, 1);
          const intlMonth = new Intl.DateTimeFormat(locale, {
            month: 'short',
          }).format(date);
          expect(localizedFaker.definitions.date.month.abbr).toContain(
            intlMonth
          );
      });

sossost avatar May 18 '24 17:05 sossost

I'm currently on mobile and thus cannot run the tests myself. The tests look good to me (although all 12 months need to be checked). ~~Also I'm not sure whether the current setting returns only the month or the full date string, for that I need some of the failure messages.~~

ST-DDT avatar May 18 '24 17:05 ST-DDT

@ST-DDT These are some failure messages.

AssertionError: expected [ 'Luni', 'Marți', 'Miercuri', 'Joi', 'Vineri', 'Sâmbătă', 'Duminică' ] to include 'sâmbătă' AssertionError: expected [ 'нед', 'пон', 'вто', 'сре', 'чет', 'пет', 'саб' ] to include 'саб.' AssertionError: expected [ 'Hé', 'Ke', 'Sze', 'Csüt', 'Pé', 'Szo', 'Va' ] to include 'V'

Failure factors can range from missing data to capitalization, dots, etc.

   describe.each(Object.entries(allLocales))(
          'for locale %s',
          (locale, fakerLocale) => {
            let localizedFaker: Faker;
            const weekdays = Array.from(
              { length: 7 },
              (_, i) => new Date(2020, 0, i + 4)
            ); // January 4-10, 2020 are Sunday to Saturday

            beforeAll(() => {
              localizedFaker = new Faker({ locale: [fakerLocale, en, base] });
            });

            it(`should return random value from date.weekday.wide_context array for context option in ${locale}`, () => {
              if (localizedFaker.definitions.date.month.wide_context) {
                const weekday = localizedFaker.date.weekday({ context: true });
                expect(
                  localizedFaker.definitions.date.weekday.wide_context
                ).toContain(weekday);
              }
            });

            it(`should return random value from date.weekday.abbr_context array for abbreviated and context option in ${locale}`, () => {
              if (localizedFaker.definitions.date.month.abbr_context) {
                const weekday = localizedFaker.date.weekday({
                  abbreviated: true,
                  context: true,
                });
                expect(
                  localizedFaker.definitions.date.weekday.abbr_context
                ).toContain(weekday);
              }
            });

            it('should use Intl.DateTimeFormat to get the weekday name in the correct locale', () => {
              for (const date of weekdays) {
                const intlWeekday = new Intl.DateTimeFormat(locale, {
                  weekday: 'long',
                }).format(date);
                expect(localizedFaker.definitions.date.weekday.wide).toContain(
                  intlWeekday
                );
              }
            });

            it('should use Intl.DateTimeFormat to get the abbreviated weekday name in the correct locale', () => {
              for (const date of weekdays) {
                const intlWeekday = new Intl.DateTimeFormat(locale, {
                  weekday: 'short',
                }).format(date);
                expect(localizedFaker.definitions.date.weekday.abbr).toContain(
                  intlWeekday
                );
              }
            });
          }
        );
        

This is my current test code.

sossost avatar May 18 '24 17:05 sossost

Please note: We want the test to ensure that we have the correct values, currently the values might be invalid.

AssertionError: expected [ 'Luni', 'Marți', 'Miercuri', 'Joi', 'Vineri', 'Sâmbătă', 'Duminică' ] to include 'sâmbătă'

If I insert sâmbătă or Sâmbătă into google translate and translate it back and forth, then I get the lowercase variant.

ST-DDT avatar May 18 '24 17:05 ST-DDT

Could you please fix the invalid values and replace it with correct ones?

Maybe using a onetime script like:

for [locale] in allLocales {
  const weekdayPath = resolve("src/locales/${locale}/date/weekday.ts");
  const storedWeekdays = await import(weekdayPath) as DateDefinition["weekday"];
  const long = [];
  const abbr = [];
  for (const i=0;i<7;i++) {
    long.push(new Intl...);
    abbr.push(new Intl....);
  }
  storedWeekdays.long = long;
  storedWeekdays.abbr = abbr;
  writeFileSync(weekdayPath, "export default "+JSON.toString(storedWeekdays));
}

And run pnpm run preflight afterwards.

EDIT: Please post the script here, so that we can use it during the review.

ST-DDT avatar May 18 '24 17:05 ST-DDT

Please merge next into your branch to avoid merge conflicts, because we recently merged #2902.

  • #2902

ST-DDT avatar May 18 '24 18:05 ST-DDT

EDIT: Please post the script here, so that we can use it during the review.

If you post the script here, we can later integrate it in our generate locale script, so that all locales always have the correct values.

Or if you want to do that directly you can add it after generateLocaleFile:

https://github.com/faker-js/faker/blob/7dc8a18f0b906f959670627240a6e5600bf47b75/scripts/generate-locales.ts#L414-L422

ST-DDT avatar May 18 '24 18:05 ST-DDT

@ST-DDT When I run the script that creates the files in the date folder, all tests now pass. However, there are a few things to check. The data previously entered for each locale is also If Intl returns English, it will be converted to English. Also, the 'vd' code was returned by Intl in my local language, so I added an exception condition.

I am a novice developer, so please understand my embarrassing generator code.

sossost avatar May 19 '24 01:05 sossost

Rather than just overwriting the existing data, could we output a table showing where the data produced by Intl differs from what we already have in Faker? That would make it easier to see discrepancies, and loop in native speakers if needed to see why.

matthewmayer avatar May 19 '24 03:05 matthewmayer

@matthewmayer That seems like a good idea. Can someone please help?? I have other things to do, so it might take some time.

sossost avatar May 19 '24 05:05 sossost

Here are the discrepancies i could find, for "wide" data (the abbr/short data has more variation), ignoring case-only differences and ignoring data where Intl doesnt seem to provide correct data eg dv and sr_RS_latin.

data type locale Faker Intl
weekday wide fa پتچشنبه,جمعه,چهارشنبه,دوشنبه,سه شنبه,شنبه,یکشنبه پنجشنبه,جمعه,چهارشنبه,دوشنبه,سه‌شنبه,شنبه,یکشنبه
weekday wide he יום חמישי,יום ראשון,יום רביעי,יום שישי,יום שלישי,יום שני,שבת יום חמישי,יום ראשון,יום רביעי,יום שבת,יום שישי,יום שלישי,יום שני
weekday wide pt_BR Domingo,Quarta,Quinta,Sábado,Segunda,Sexta,Terça domingo,quarta-feira,quinta-feira,sábado,segunda-feira,sexta-feira,terça-feira
weekday wide pt_PT Domingo,Quarta,Quinta,Sábado,Segunda,Sexta,Terça domingo,quarta-feira,quinta-feira,sábado,segunda-feira,sexta-feira,terça-feira
weekday wide ur اتور,بدھ,پیر,جمعرات,جمعہ,منگل,ہفتہ اتوار,بدھ,پیر,جمعرات,جمعہ,منگل,ہفتہ
weekday wide zh_CN 星期二,星期六,星期三,星期四,星期天,星期五,星期一 星期二,星期六,星期日,星期三,星期四,星期五,星期一
weekday wide zh_TW 星期一,星期二,星期三,星期五,星期六,星期天,星期四 星期一,星期二,星期三,星期五,星期六,星期日,星期四
month wide ar آب,آذَار,أَيَّار,أَيْلُول,تِشْرِين ٱلْأَوَّل,تِشْرِين ٱلثَّانِي,تَمُّوز,حَزِيرَان,شُبَاط,كَانُون ٱلْأَوَّل,كَانُون ٱلثَّانِي,نَيْسَان أبريل,أغسطس,أكتوبر,ديسمبر,سبتمبر,فبراير,مارس,مايو,نوفمبر,يناير,يوليو,يونيو
month wide ur اپریل,اکتوبر,اگست,جنوری,جولائ,جون,دسمبر,ستمبر,فروری,مارچ,مئ,نومبر اپریل,اکتوبر,اگست,جنوری,جولائی,جون,دسمبر,ستمبر,فروری,مارچ,مئی,نومبر
month wide uz_UZ_latin Aprel,Avgust,Dekabr,Fevral,Iyul,Iyun,Mart,May,Noyabr,Oktyabr,Sentyabr,Yanvar Aprel,Avgust,Dekabr,Fevral,Iyul,Iyun,Mart,May,Noyabr,Oktabr,Sentabr,Yanvar
month wide vi Tháng Ba,Tháng Bảy,Tháng Chín,Tháng Giêng,Tháng Hai,Tháng Mười,Tháng Mười Hai,Tháng Mười Một,Tháng Năm,Tháng Sáu,Tháng Tám,Tháng Tư Tháng 1,Tháng 10,Tháng 11,Tháng 12,Tháng 2,Tháng 3,Tháng 4,Tháng 5,Tháng 6,Tháng 7,Tháng 8,Tháng 9

i think you really need a language speaker to check these manually. For example, in Chinese which I do know, both 星期天 and 星期日 are both perfectly valid words for "Sunday".

In Vietnamese it looks like we are spelling out the month numbers versus using digits.

In Uzbek there seems to be some debate about the correct transliteration for September and October - see https://daryo.uz/2018/01/15/sentabr-yoziladimi-yoki-sentyabr , perhaps @Mirazyzz can help clarify!

matthewmayer avatar May 19 '24 08:05 matthewmayer

Here are the discrepancies i could find, for "wide" data (the abbr/short data has more variation), ignoring case-only differences and ignoring data where Intl doesnt seem to provide correct data eg dv and sr_RS_latin.

data type locale Faker Intl weekday wide fa پتچشنبه,جمعه,چهارشنبه,دوشنبه,سه شنبه,شنبه,یکشنبه پنجشنبه,جمعه,چهارشنبه,دوشنبه,سه‌شنبه,شنبه,یکشنبه weekday wide he יום חמישי,יום ראשון,יום רביעי,יום שישי,יום שלישי,יום שני,שבת יום חמישי,יום ראשון,יום רביעי,יום שבת,יום שישי,יום שלישי,יום שני weekday wide pt_BR Domingo,Quarta,Quinta,Sábado,Segunda,Sexta,Terça domingo,quarta-feira,quinta-feira,sábado,segunda-feira,sexta-feira,terça-feira weekday wide pt_PT Domingo,Quarta,Quinta,Sábado,Segunda,Sexta,Terça domingo,quarta-feira,quinta-feira,sábado,segunda-feira,sexta-feira,terça-feira weekday wide ur اتور,بدھ,پیر,جمعرات,جمعہ,منگل,ہفتہ اتوار,بدھ,پیر,جمعرات,جمعہ,منگل,ہفتہ weekday wide zh_CN 星期二,星期六,星期三,星期四,星期天,星期五,星期一 星期二,星期六,星期日,星期三,星期四,星期五,星期一 weekday wide zh_TW 星期一,星期二,星期三,星期五,星期六,星期天,星期四 星期一,星期二,星期三,星期五,星期六,星期日,星期四 month wide ar آب,آذَار,أَيَّار,أَيْلُول,تِشْرِين ٱلْأَوَّل,تِشْرِين ٱلثَّانِي,تَمُّوز,حَزِيرَان,شُبَاط,كَانُون ٱلْأَوَّل,كَانُون ٱلثَّانِي,نَيْسَان أبريل,أغسطس,أكتوبر,ديسمبر,سبتمبر,فبراير,مارس,مايو,نوفمبر,يناير,يوليو,يونيو month wide ur اپریل,اکتوبر,اگست,جنوری,جولائ,جون,دسمبر,ستمبر,فروری,مارچ,مئ,نومبر اپریل,اکتوبر,اگست,جنوری,جولائی,جون,دسمبر,ستمبر,فروری,مارچ,مئی,نومبر month wide uz_UZ_latin Aprel,Avgust,Dekabr,Fevral,Iyul,Iyun,Mart,May,Noyabr,Oktyabr,Sentyabr,Yanvar Aprel,Avgust,Dekabr,Fevral,Iyul,Iyun,Mart,May,Noyabr,Oktabr,Sentabr,Yanvar month wide vi Tháng Ba,Tháng Bảy,Tháng Chín,Tháng Giêng,Tháng Hai,Tháng Mười,Tháng Mười Hai,Tháng Mười Một,Tháng Năm,Tháng Sáu,Tháng Tám,Tháng Tư Tháng 1,Tháng 10,Tháng 11,Tháng 12,Tháng 2,Tháng 3,Tháng 4,Tháng 5,Tháng 6,Tháng 7,Tháng 8,Tháng 9 i think you really need a language speaker to check these manually. For example, in Chinese which I do know, both 星期天 and 星期日 are both perfectly valid words for "Sunday".

In Vietnamese it looks like we are spelling out the month numbers versus using digits.

In Uzbek there seems to be some debate about the correct transliteration for September and October - see https://daryo.uz/2018/01/15/sentabr-yoziladimi-yoki-sentyabr , perhaps @Mirazyzz can help clarify!

Month names are derived from Russian, and therefore IMO the proper usage would be "sentYAbr" to reflect the Russian "сентЯбрь." Since there is no direct equivalent of the letter "Я" in our language, we use the combination of "Y" and "A." However, the debate on this matter is still ongoing, and no official approach has been decided yet. Interestingly, some official school books use both variations on different pages.

Given this uncertainty, I suggest we choose the variation that minimizes our workload and simplifies our process.

Mirazyzz avatar May 19 '24 10:05 Mirazyzz

@ST-DDT I have updated the code you reviewed. Thank you very much for your kind and detailed review. Thanks to you, I am learning a lot.

sossost avatar May 19 '24 13:05 sossost

@ST-DDT I have updated it to reflect the parts you reviewed. ynnsuis is my company account.

sossost avatar May 20 '24 12:05 sossost

I reduced the diff a bit, fixed an issue with the date/index.ts generation and generated the missing files for preview. Sorry, for directly interfering with your PR.

ST-DDT avatar May 20 '24 19:05 ST-DDT

  • Creates unnecessary definitions for all descendant locales, e.g. en_AU, en_GB, en_HK when these should probably just inherit from en.
  • Creates definitions for base locale, which it shouldn't

Currently, it does not because Intl uses de-de and we use de_DE. This might have to be changed and then filtered for redundancy though.

@sossost Could you please fix that? (Please pull before changing any code)

  • Doesn't make seperate arrays for wide and wide_context or abbr and abbr_context in locales where these are different e.g. fr or ru - see documentation of "context" parameter at fakerjs.dev/api/date#month

I'm not sure whether I get you correctly here. The context variants arent touched or generated because AFAIK there is no Intl way to do that.

ST-DDT avatar May 20 '24 19:05 ST-DDT

@ST-DDT No! thank you Actually, this is a bit of a difficult issue for me because I still lack a lot. Thank you very much for your direct and indirect help.

sossost avatar May 21 '24 00:05 sossost

  • Creates unnecessary definitions for all descendant locales, e.g. en_AU, en_GB, en_HK when these should probably just inherit from en.
  • Creates definitions for base locale, which it shouldn't

Currently, it does not because Intl uses de-de and we use de_DE. This might have to be changed and then filtered for redundancy though.

@sossost Could you please fix that? (Please pull before changing any code)

  • Doesn't make seperate arrays for wide and wide_context or abbr and abbr_context in locales where these are different e.g. fr or ru - see documentation of "context" parameter at fakerjs.dev/api/date#month

I'm not sure whether I get you correctly here. The context variants arent touched or generated because AFAIK there is no Intl way to do that.

Sorry I didn't understand exactly what to do. Can you give me more hints??

sossost avatar May 22 '24 13:05 sossost

Sorry I didn't understand exactly what to do. Can you give me more hints??

  1. Replace all
new Intl.DateTimeFormat({ locale, ...});

With

new Intl.DateTimeFormat({ locale: locale.replaceAll('_', '-'), ...});
  1. Find all instances where locale.date.weekday/month is the same as the parent locale. E.g. en is the parent of en_US and thus the locale files would be identical so the later should be skipped/not created. Some locales such as uz_UZ_latin dont have parents so they need the additional files.

ST-DDT avatar May 22 '24 13:05 ST-DDT

With a big thanks to @sosost for their work so far, but I'm starting to feel trying to do this as a script that generates everything at once is maybe a mistake.

Instead perhaps we can modify it so it's run manually and you give it a single locale as a parameter, and it attempts to generate the days and months for that locale only.

That way we don't need to code in lots of edge cases, and it's easy to run it for missing locales, and review the diff for one locale at a time.

Once generated, it's not like the days of the week or months of the year will ever change ...

matthewmayer avatar May 24 '24 00:05 matthewmayer

Sorry, for the delay. I will tackle this once I have some free time.

ST-DDT avatar Jun 12 '24 15:06 ST-DDT

Ready for review.

The locales can be more easily reviewed using:

rm -rdf src/locales
git reset next -- src/locales
git restore src/locales

pnpm run preflight

git add src/locales

ST-DDT avatar Jun 18 '24 22:06 ST-DDT

Added to the meeting notes for discussion.

ST-DDT avatar Jun 19 '24 05:06 ST-DDT

I tested with Node 20.0.0 and 22.2.0. There are differences in data for en-AU, eo, es-MX, mk, uk and vi. Mostly minor things like tweaks to the shortened versions, capitalization etc. But enough to cause havoc with people running preflight with different node versions.

matthewmayer avatar Jun 19 '24 07:06 matthewmayer

Team Notice

We talked about it in the team meeting, but we haven't been able to decide anything yet.

ST-DDT avatar Jun 20 '24 16:06 ST-DDT

Team Decision

  • We will not merge this, as it
    • the data changes frequently
    • it breaks our CI tests
    • any variant of "only generating the data" produces unpredictable results depending on the used node version
  • The data are easier to manually insert than describing how to run the custom script

Thanks for your contribution anyway ❤️

ST-DDT avatar Oct 26 '24 14:10 ST-DDT