website icon indicating copy to clipboard operation
website copied to clipboard

update validate-tooling-data for eliminate case insensitive languages

Open Vishv04 opened this issue 9 months ago • 4 comments

What kind of change does this PR introduce? Feature - Adds case-insensitive unique validation for language entries

Issue Number:

  • Closes #1443
  • Related to #___
  • Others?

Screenshots/videos: Forcefully made mistakes in the name of language, image

Validator finds the mistake, image

If relevant, did you update the documentation?

Summary This PR introduces case-insensitive unique validation for language entries in the tooling data to solve several existing problems:

  1. Inconsistent language casing across tools (e.g., "JavaScript" vs "javascript" vs "JAVASCRIPT")
  2. Potential confusion for users seeing the same language listed multiple times

My solution: Implements a custom AJV keyword caseInsensitiveUnique that:

  • Detects and reports case-insensitive duplicates using set
  • Provides clear error messages for easy fixes
           ajv.addKeyword({
              keyword: 'caseInsensitiveUnique',
              type: 'array',
              validate: function (schema, data) {
                if (!Array.isArray(data)) return false;
                
                const languagesSet = new Set();
                const languagesLowercaseSet = new Set();
                data.forEach((tool) => {
                  if (tool.languages) {
                    tool.languages.forEach((language) => {
                      languagesSet.add(language);
                      languagesLowercaseSet.add(language.toLowerCase());
                    });
                  }
                });
                if (languagesSet.size !== languagesLowercaseSet.size) {
                  console.error('Duplicate languages found');
                  const lowercaseMap = new Map();
                  languagesSet.forEach((language) => {
                    lowercaseMap.set(
                      language.toLowerCase(), 
                      (lowercaseMap.get(language.toLowerCase()) || 0) + 1
                    );
                  });
                  
                  lowercaseMap.forEach((value, key) => {
                    if (value > 1) {
                      console.log('Duplicate found for:', key);
                    }
                  });
                  validate.errors = [{
                    keyword: 'caseInsensitiveUnique',
                    message: 'array contains case-insensitive duplicates',
                    params: { keyword: 'caseInsensitiveUnique' }
                  }];
                  return false;
                }
                return true;
              }
            });

Does this PR introduce a breaking change?
Yes

Impact:
This PR enforces case-insensitive uniqueness for language entries. Any existing tooling data that includes language names with inconsistent casing—such as "JavaScript" and "javascript"—will now fail validation. This change helps eliminate redundancy and confusion caused by duplicate entries with different letter cases.

Who is affected:
Tool maintainers and contributors who have added language entries with varying casing.

Migration Path:
Update your languages arrays to ensure that each language appears only once in a consistent format, preferably matching the casing defined in the schema enum. For example:

# ❌ Before
languages:
  - "JavaScript"
  - "javascript"
  - "Go"
  - "go"

# ✅ After
languages:
  - "JavaScript"
  - "Go"

Vishv04 avatar Mar 14 '25 11:03 Vishv04

built with Refined Cloudflare Pages Action

⚡ Cloudflare Pages Deployment

Name Status Preview Last Commit
website ✅ Ready (View Log) Visit Preview ac51230d289640391c705f9aa13a3e90a70bb551

github-actions[bot] avatar Mar 14 '25 11:03 github-actions[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (219521e) to head (ac51230).

Additional details and impacted files
@@            Coverage Diff            @@
##              main     #1516   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           10        10           
  Lines          396       396           
  Branches       106       106           
=========================================
  Hits           396       396           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Mar 14 '25 11:03 codecov[bot]

Hey @Vishv04, Can you tell me what change you made for this file 'validate-tooling-data.yml'. Is it necessary to change something there because the reason for failing this PR is changing this unauthorized file

jagpreetrahi avatar May 09 '25 16:05 jagpreetrahi

Hey @Vishv04, Can you tell me what change you made for this file 'validate-tooling-data.yml'. Is it necessary to change something there because the reason for failing this PR is changing this unauthorized file

Hi @jagpreetrahi, I added a custom caseInsensitiveUnique rule in validate-tooling-data.yml to detect case-insensitive duplicates in the languages array—like treating "JavaScript" and "javascript" as the same. The logic checks if the input is an array, then builds two sets: one with original values and one with lowercase versions. If their sizes differ, it logs duplicates and throws a validation error.

Vishv04 avatar May 14 '25 09:05 Vishv04

@jviotti can i get your support to help me reviewing this solution?

benjagm avatar Jun 14 '25 12:06 benjagm

I don't think there is any reason to add a new keyword here. Why not just use pattern to set a regular expression that only allows lowercase strings?

jviotti avatar Jun 16 '25 12:06 jviotti

Thank you @benjagm @jviotti for your response on this PR.

I don't think there is any reason to add a new keyword here. Why not just use pattern to set a regular expression that only allows lowercase strings?

Yes, using a pattern works, but it feels a bit like we're controlling user input, since users will naturally write "Javascript" instead of "javascript" (just my assumption, it might be wrong). I created the custom keyword to avoid forcing users to change how they enter data. Still, using a pattern could also be a workable option.

Vishv04 avatar Jun 17 '25 19:06 Vishv04

I really suggest just using pattern to avoid the extra complexity of a new keyword (mainly with AJV). The convention can be to just force everybody to write languages in lowercase.

jviotti avatar Jun 17 '25 19:06 jviotti

Okay @jviotti, I will use pattern and fix this issue. Thank you for your help.

Vishv04 avatar Jun 22 '25 17:06 Vishv04