monorepo icon indicating copy to clipboard operation
monorepo copied to clipboard

ids for variants

Open samuelstroschein opened this issue 2 years ago • 7 comments

Emerged in https://github.com/opral/monorepo/issues/1817#issuecomment-1928163241

Problem

Variants have no unique identifier and can't be addressed by the matcher. Among the downsides are:

  1. ambiguous message lint reporting
  2. non-existent CRUD possibilities
  3. merge conflicts

1. Ambiguous message lint reporting

The report function takes a message id and a language tag to report a problem for a variant.

The language tag acts as a substitution for a missing id of a variant. The introduction of multiple variants for one language tag will break message lint rules. Addressing this problem could be the introduction of an additional matcher property:

// type of the report function used for linting messages
function report(args: {
  messageId: Message["id"];
  languageTag: LanguageTag;
+ match
  body: MessageLintReport["body"];
});

Adding a match property will be necessary to support multiple variants per language tag.

The DX of handling three components (message ids, languagetags, matchers) to report a problem for a variant is bad. Furthermore, a variant is not uniquely identifiable by the combination of a language tag and matcher! No guarantee exists that a variant with the same matcher does not exist twice for a given message:

message: {
  variants: [
    { match: [*, female], pattern: "" },
    { match: [*, female], pattern: "" },
  ]
}

Which message is being linted by the report with match: [*,female]?

2. Missing CRUD operations

One might propose a "duplicate variant for matcher" lint rule to combat the problem above. Unfortunately, no inlang app will be able to delete a duplicate that a user selected. The duplicate variants can not be uniquely identified. The message id, matcher, and language are identical.

message: {
  variants: [
      // WHICH VARIANT SHOULD BE DELETED?
-    { match: [*, female], pattern: "" },
-    { match: [*, female], pattern: "" },
  ]
}

3. Merge conflicts

If users create a variant for the same language and matcher, we can't resolve the merge conflict for the user because both variants are "identical".

Proposal

Add UUIDs to variants for unique identification.

Pros

  • no merge conflicts (!)
  • CRUD ops on variants become possible
  • a lint rule like "duplicate variants for the same matcher" becomes possible

Cons

  • N/A
message: {
  variants: [
      { 
+       id: 93j872kkks9328hs, 
        match: [*, female], 
        pattern: "" 
      },
     { 
+       id: k290s828nsllleeee, 
        match: [*, female], 
        pattern: "" 
      },
  ]
}

Why do no merge conflicts occur?

Two distributed creations of a variant with the same language tag and matcher will have two distinct UUIDs. We can merge the variants by UUID. Users can afterward filter out duplicated with a "duplicate variant" lint rule.

Tasks

  • [ ] change message lint rules to report via only message id or message id and variant id
  • [ ] add id to variants

samuelstroschein avatar Feb 06 '24 22:02 samuelstroschein

cc @martin-lysk ids seem necessary. merge conflicts can not be automatically resolved without ids and neither can lint rules that report duplicates work reliably.

samuelstroschein avatar Feb 06 '24 22:02 samuelstroschein

Having a lint rule here doesn't help the user of the sdk. I propose to make your example of duplicate variants a parse time error (like a duplicate property name in an object literate would lead to a json parse error).

This makes the variants uniquely identifiable by its matches (even if they don’t exist! - usefull for lints).

So if you end up with a conflict on a variant you could: A) show versions next to each other and choose how to resolve the conflict like a git merge conflict or B) make the last one win. the merge will be visible in the history anyway

Reasoning: While message format does not state this in the syntax spec:

A well-formed message is considered valid if the following requirements are satisfied:

The number of keys on each variant must be equal to the number of selectors. At least one variant's keys must all be equal to the catch-all key (*).

For the pattern selection (compare ICU) to work - matches need to be sortable by its matches and two equal matches can't get sorted in a predictable way.

A selector need end up with one variant eventually also for our SDK users (ui in fink / parrot/ cli / paraglide...)

  • How does the machine translate should handle the case when the counterpart of a missing variant in the ref language retrurns two messages?
  • how should fink display multiple variant with the same matcher?
  • how should parrot know what variant to switch to if it finds multiple?
  • how should paraglide compile?

A good argument for the id is the fact that it can merge. While this is technically a valid point - I argue that for our usecase its not right choice and not even wanted.

From the usecase of a message everything below locale/language tag should be atomic a and merge conflicts should be solved by choosing one over the other.

While multiple translators may work in parallel on different languages in the same message without directly conflicting with each other - changing one variant within one language usually affect the other variants.

Example 1: Adding a selector changes all variants:

message: {
selectors: [],
  variants: [
    { match: [], "languageTag": "de", pattern: "Sie haben {n_messages} neue Nachrichten" },
  ]
}
message: {
selectors: ["n_messages"],
  variants: [
-     { match: [], "languageTag": "de", pattern: "Sie haben {n_messages} neue Nachrichten" },
+    { match: ["*"], "languageTag": "de", pattern: "Sie haben {n_messages} neue Nachrichten" },
+    { match: ["1"], "languageTag": "de", pattern: "Sie haben eine neue Nachricht" },
  ]
}

Example 2: A Change in the pattern from message to email changes both variants

message: {
selectors: ["n_messages"],
  variants: [
-    { match: ["*"], "languageTag": "de", pattern: "Sie haben {n_messages} neue  Nachrichten" },
+    { match: ["*"], "languageTag": "de",pattern: "Sie haben {n_messages} neue Emails" },
-    { match: ["1"], "languageTag": "de", pattern: "Sie haben eine neue Nachricht" },
+    { match: ["1"], "languageTag": "de", pattern: "Sie haben eine neue Email" },
  ]
}

Important to note here: The identity mutates when a new selector is introduced - but that’s a good thing when we think about the lint rules. If you „mute“ a lint rule, like a missing variant (because you don't want to gender in a specific variant and a fallback is ok). You are actually not muting on an existing variant but on a selector that exists in the ref variant and does not exist in the target language.

German (ref)

message: {
 selectors: ["form", "n_messages"],
  variants: [
    { match: ["familiar", "*"], "languageTag": "de", pattern: "Du hast {n_messages} neue Nachrichten" },
    { match: ["familiar", 1"], "languageTag": "de", pattern: "Du hast eine neue Nachricht" },
    { match: ["*", "*"], "languageTag": "de", pattern: "Sie haben {n_messages} neue Nachrichten" },
    { match: ["*", "1"], "languageTag": "de", pattern: "Sie haben eine neue Nachricht" },

    { match: ["*", "*"], "languageTag": "en", pattern: "You have {n_messages} new messages" },
    { match: ["*", "1"], "languageTag": "en", pattern: "You have a new message" },
  ],
  mutedLints: [
      { ruleId: "missing_variant", match: ["familiar", "*"], reason: "not needed in english", mutedBy: "userID", mutedAt: .. },
      { ruleId: "missing_variant", match: ["familiar", "1"], reason: "not needed in english", mutedBy: "userID", mutedAt: .. },
  ]
}

Usually a change on the matches of a message change the meaning of the patterns in a way, that the reason to mute a lint rule in the state before doesn't hold anymore.

Regarding the lint rule function - I would change the api here and discussed this some time ago with @NiklasBuchfink already. Instead of having a huge report for all messages i would turn the api around and ask for lints by message instead - this would also simplify the dx regarding matches here - but this is out of scope of this ticket.

martin-lysk avatar Feb 10 '24 20:02 martin-lysk

@martin-lysk I propose to make your example of duplicate variants a parse time error

Clarification is sought on how users should fix duplicate variants in inlang apps when the SDK yields a parsing error. On the contrary, the linting system provides an existing primitive to fix broken messages that every inlang app supports.

Downroad issues like "how should machine translate behave if a duplicate variant is matched", "paraglide compile" can throw individually or naively match the first match. In any case, give the user the visibility and possibility to fix the issue themselves before something goes wrong. A strong lint like "hey, this will cause bugs" will not leave the user surprised by an error that throws later like "machine translate couldn't be executed due to ambiguous variant [male, *, 4]".

The UI in fink etc is not an issue. The project is fine. Multiple messages are displayed with two variants having a linting error.

From the usecase of a message everything below locale/language tag should be atomic a and merge conflicts should be solved by choosing one over the other.

Yes, lint rules. A user will see the problematic variants via lint rules and be prompted to decide for one.

Another model of resolving such a conflict will require us to build a new primitive that will ultimately achieve the same thing of removing one duplicate. Besides achieving the same thing, I expect that lint rules will provide a better UX: They don't force the user to make an immediate decision once the issue appears aka "your project is broken now, you need to fix it now". No, the project is not broken. Your messages have a lint error that you should fix to avoid downroad problems. You can fix them now, in 10 minutes, tomorrow, or just before you deploy the new version of your app (just like you do with every other lint rule).

Conclusion

  • The linting approach seems superior due to UX and simplicity to implement.
  • We should ship a duplicate variant lint rule by default with new projects.
  • The introduction of IDs for variants will make CRUD ops easier.
  • @martin-lysk if you disagree, please elaborate how the UX of resolving duplicate variants can be reasonable implemented and leads to a better UX than using the known concept of lint rules

Offtopic, no reply needed

I would change the api here and discussed this some time ago with @NiklasBuchfink already.

Please cc me in the issue. I am not sure what you are proposing here. But that does not need to be discussed in this issue.

samuelstroschein avatar Feb 13 '24 02:02 samuelstroschein

I still have some reservation but it's more a gutt feeling for now. Lets agree on the ids.

On thing that comes up with adding an id here:

The question if the order of the variant matter came up some time in the past and more recently also by @jldec Since we have an Id now we could clearify that the order doesn't matter and introduce a map variantId->variant instead of an array like this:


message: {
 selectors: ["form", "n_messages"],
  variants: {
    "3624183d-581a-45bc-9d3d-cafeddaf8585":  { match: ["familiar", "*"], "languageTag": "de", pattern: "Du hast {n_messages} neue Nachrichten" },
    "4b4685c9-f163-4694-ae8c-4b83402a293c":  { match: ["familiar", 1"], "languageTag": "de", pattern: "Du hast eine neue Nachricht" },
    "7a14d8e4-d870-4f33-bfb3-f4337b756e18":  { match: ["*", "*"], "languageTag": "de", pattern: "Sie haben {n_messages} neue Nachrichten" },
    { match: ["*", "1"], "languageTag": "de", pattern: "Sie haben eine neue Nachricht" },

    { match: ["*", "*"], "languageTag": "en", pattern: "You have {n_messages} new messages" },
    { match: ["*", "1"], "languageTag": "en", pattern: "You have a new message" },
  ]
}

We could go one step further an give a language its own property:

message: {
selectors: ["form", "n_messages"],
variants: {
    "de" : {
        "3624183d-581a-45bc-9d3d-cafeddaf8585":  { match: ["familiar", "*"], "languageTag": "de", pattern: "Du hast {n_messages} neue Nachrichten" },
          "4b4685c9-f163-4694-ae8c-4b83402a293c":  { match: ["familiar", 1"], "languageTag": "de", pattern: "Du hast eine neue Nachricht" },
          "7a14d8e4-d870-4f33-bfb3-f4337b756e18":  { match: ["*", "*"], "languageTag": "de", pattern: "Sie haben {n_messages} neue Nachrichten" },
    }
    "en": {
          "3624183d-581a-45bc-9d3d-cafeddaf8585":     { match: ["*", "*"], "languageTag": "en", pattern: "You have {n_messages} new messages" },
                   "3124183d-581a-45bc-9d3d-cafeddaf8585": { match: ["*", "1"], "languageTag": "en", pattern: "You have a new message" },
     }



  }
}

Lookups won't have to iterate though the object to access a varaint when the id is known. (while premature optimization) The complexity of the lookups of varaints by language for lint rules for example would fold n times (while n is the number of language tags in a project)...

Just thinking out loud

martin-lysk avatar Feb 13 '24 13:02 martin-lysk

@martin-lysk don't mix types and performance optimization for queries/storage ⚠️

We had a nesting of objects under language tags but refactored it.

The code ergonomics were bad, and a potential migration to a relational DB would be hard. The data structures were not isolated, breaking the mental model of "A message has variants. A variant has a language tag, matcher, and pattern means that devs want to operate on variants directly (my reasoning why IDs should be introduced). If a variant has a dependency on the message type, operating on variants is convoluted.

Having separate types with no nesting aligns with the mental model and desired usage

console.log(`The language tag is {variant.languageTag}`)

await query.variant.delete(variant.id)

Nesting different data entities among each other leads to horrible ergonomics


const potentialLanguageTags = Object.entries(message.variants).filter(([languageTag, variants]) 
   => variants.some((v) => v.id === variant.id))

if (potentialLanguageTags.length > 1){
  throw Error("Couldn't identify the language tag of the variant")
}

console.log(`The language tag is {Object.keys(potentialLanguageTags)[0]}`)

// this code might also not work because the id potentially needs to be retrieved from the message too!
await query.variant.delete(variant.id)

Conclusion

@martin-lysk Do we agree on:

  1. The linting system is a good way to deal with duplicate variants?
  2. Not changing the data structure (and mix entities which each other in nested objects)?
  3. Introducing ids as proposed (to leverage the linting system and ease CRUD ops)?

samuelstroschein avatar Feb 13 '24 14:02 samuelstroschein

@martin-lysk don't mix types and performance optimization for queries/storage ⚠️ sure - but in case of the sdk those are currently the same

potential migration to a relational DB would be hard. Would be important to understand that better for future design decisions. could you draw the table structure you would have in mind atm?

We could also denormalize the property and make the id's avaialble in the variant type as well (compare the json you se in my previous comment).

  1. The linting system is a good way to deal with duplicate variants?

ok with that

  1. Not changing the data structure (and mix entities which each other in nested objects)?

See my comment above - but I am ok with keeping the structure for now and close the discussion for now

  1. Introducing ids as proposed (to leverage the linting system and ease CRUD ops)?

👍

martin-lysk avatar Feb 13 '24 16:02 martin-lysk

I take that we agree.

Another (killer) argument for the linting system is the possibility of adopting inlang for existing projects. We do not have a guarantee that existing projects won't have duplicate variants. Trying to adopt inlang for an existing project with duplicate variants would be a pain with a parsing error. All parsing errors must be resolved before an inlang project provides value. The linting system on the other hand enables adoption "and fix it later".

samuelstroschein avatar Feb 13 '24 19:02 samuelstroschein

closing this as ids for variants will come

samuelstroschein avatar May 21 '24 20:05 samuelstroschein