monorepo SDK persistence of messages in project directory

Context

We need persistence of messages before starting #1585. The storage format will influence the importers and exporters.

Scope

Test different storage persistence formats. Does splitting messages

[ ] implement storage persistency in SDK
- [ ] test persist files by message and language
- [ ] test persist split by message only
[ ] "degrade" saveMessages and loadMessages to act as "importers" (until #1585 is implemented)

1. loadProject "loads" message from project directory
2. IF loadMessages API defined -> [NOT BLOCK UI] loadProject calls `loadMessages` -> write messages to project directory
3. CRUD operations happen on project directory
4. on CRUD operations, `saveMessages` is called to "export" messages

Dec 07 '23 16:12 samuelstroschein

Most interesting: Does having 60.000 messages files work?

git push
pull requests
initial import of a message (is the diff too strong)
will users raise concerns about the approach?

Dec 07 '23 16:12 samuelstroschein

@martin-lysk interesting observation. The per-file splitting for version control would be great as can be seen in https://www.loom.com/share/6636f233340340d9b7f5e237c922592f. I was able to re-construct how message changes based on the diff, given that the format is human-readable.

Maybe a human-readable format would also be good for persisting our messages to enable easier diffing of messages (until we have semantic diffing and diffing visualization in @inlang/lix; @inlang/lix are pings like this one helpful to provide information where requirements are coming from?)

Dec 07 '23 22:12 samuelstroschein

We should push for the one file per message approach to manifest our commitment to Lix.

I had more thoughts overnight. If we have a one-file-per-message approach, we can directly use lix for change control and do not need to build versioning features on inlangs end. Sure, the change control approach is naive with one file per message but subsequent change control improvements are pushed on Lix's end: "Hey lix, we need better versioning/performance/semantic meaning/whatever." instead of building workarounds in inlang.

Dec 08 '23 15:12 samuelstroschein

@samuelstroschein yes thanks, super helpful! this looks really great progress, please plan this in a way that we can limit the number of files to 1000 per folder long term. no need to support this straight away but keep in mind that once we support huge project we need to have a folder prefix that can limit the files per folder to ~1000 . @martin-lysk i changed my mind regarding the namespace folder structure we talked about earlier, limiting number of files per folder would be a nice sideeffect of that structure, correct?

Dec 11 '23 23:12 janfjohannes

@janfjohannes lease plan this in a way that we can limit the number of files to 1000 per folder long term

Where is this limitation coming from?

@martin-lysk and I want to test the upper limit of files in git. I think in the ballpark of 10.000 files (messages per language) per folder, though. That is 10x higher than the 1.000 files limit you mention.

Dec 11 '23 23:12 samuelstroschein

@samuelstroschein the limit is most filesystem impoementations on OSes, folder performance takes a hit at around 1000 to 2000 files per folder, which is why gits internal folders have the format /folder/hashprefix/hash instead of jsuts dumping all hashed files into a big folder.

Dec 12 '23 12:12 janfjohannes

Should filenames represent the message name or the message id? - yes we have this question again ;-)

Message id and message name are currently the same. @samuelstroschein suggested to have files without a speaking name to explain one should not touch them and make them the id of the message. A lint could be written that disallows duplicate message "names".

This would require

to introduce a message "name" as an additional property of the message that stores the imported names from importer/exports like json android or ios.
change of the sdk api or introduce a query by name that could result in multiple messages
all places where we use the id need to be checked - if they meant to use the id or the name of a message (header column in the editor)
The you would need to be able to explain why multiple messages with the same name can exist
a lint rule that fires when we find multiple messages with the same name

How should the sdk handle corrupt message files?

The current system imports all messages or fails. Now the messages get loaded one by one - we could have multiple messages be failing since we watch and react on changes in the files system the errors could even appear after the inital load.

how should we store the messages - what format? we currently store the messages as json - we could think of a more git firendly format than json. Are those files meant to be never touched by hand?

how do we split the messages - currently not split by language?

Dec 12 '23 20:12 martin-lysk

Should filenames represent the message name or the message id? - yes we have this question again ;-)

Message id. We will/need to introduce keys/names as part of #1585. This question is therefore unrelated to this issue.

How should the sdk handle corrupt message files?

Push a CorruptMessageError to the project.errors that can be displayed by apps.

how should we store the messages - what format?

Let's go for JSON. Increases interop in the JS ecosystem, the message format will expand based on JS objects (perfect for JSON) and doesn't require a special parser. We will surely not make the 100% perfect choice. Usage will show in the future what requirements we have.

Dec 13 '23 15:12 samuelstroschein

@martin-lysk thoughts after our call:

if we can introduce "aliases" in a simple manner (https://github.com/inlang/monorepo/issues/1585#issuecomment-1860875401), we can gain control over ids and persistence.
if we have control over ids, a lot of follow up problems and migrations are avoided.

Alias proposal

I proposed an alias structure in https://github.com/inlang/monorepo/issues/1585#issuecomment-1860875401 to separate the alias and id/persistence discussion.

Directory structure thoughts

We know how the directories is structured. I would not assume that we need a lot of stat calls and would optimize for never hitting over 1000 files in a directory.
I would split by language to avoid merge conflicts when one language is modified.

- project.inlang
 - messages
   - [part1]
     - [part2]
       - [part3]
         - en.json
         - de.json
   - human
     - laptop
       - sidewalk
         - en.json
         - de.json
       - bird
         - en.json
         - de.json
    - sky
       - food
         - en.json
         - de.json

Dec 18 '23 15:12 samuelstroschein

I have not given it a try but the following reason make me doubt we should go this path now.

TL;DR

git will not perform as smoothly in this structure in the browser as the current setup (saw the reason in the indexStatus handling)
This would be harder to implement in the sdk atm
some fs's may not like so many files (the current implementation of parrots figma'fs is one example ;-)
we can solve the merging problem with one file per message as well

git will not perform as smoothly in this structure in the browser as the current setup (saw the reason in the indexStatus handling)

A huge portion of the message will only have on variant - in most cases the size of the tree entry + the blob hash would exceed the actual information within the file
Git stores one entry per tree (folder) this means a lot of overhead when a change takes place in multiple files since we need to traverse the change up the tree and to check for changes the other way arround - key is a good distribution between files per folder - rule of thump not more than 1000 but as much as possible
gitIndex is held per file - imports and migrations, like the initial import will be painful and while the current setup is quick for 2200 files factor this by 32 languages it will become a problem i bet.

This would be harder to implement in the sdk atm

sdk api doesn't organize by language, a check if a message is the same is currently naively implemented as a json.stringify compare. splitting by language would be more effort.
a deletion event on a file means all variants of a language are gone and not a message was deleted
we would need to make a languageMessage first class object in the sdk instead o
TL;DR to many files 2100 (one per message) compared to 67200 if 32 languages are configured
(figmas setPluginData performance decrease with the amount of keys 5k are still ok but it becomes a bottleneck

some fs's may not like so many files

The performance of getPluginData and setPluginData decrease per key. This is not a problem for < 10000 Keys but it becomes slow.

we can solve the merging problem with one file per message as well

I agree of making the language atomic - and a file would be a logical step but we could also archive this by taking a look at the storage format. If we dedicate a section per language like in yml instead of json gits row based merge will have no problem to automatically resolve parallel edits.

I would therefore suggest we start of with:

- project.inlang
 - messages
   - [part1]
       - [part2]_[part3].json
   - human
     - laptop_sidewalk.json
     - laptop_bird.json
     - sky_food.json

This will allow us to save up to 300.000 messages without hitting the thump rule of 1000 files per folder limit - guess if we have to manage 300k message in on project we have another topic ;-). With the 200 keys in the first part of a human-id we would have –8 messages stored per folder in a project like cal.com

Dec 18 '23 18:12 martin-lysk

I do not see profound arguments against language splitting in storage.

The arguments you mention regarding git are reasonable, but a) we need to hit a large scale until this becomes a problem b) @inlang/lix should fix those issues. I'd rather build the right thing in inlang and push infra problems to lix than building wrong abstractions in inlang and never get lix to fix those issues.

Furthermore, the SDK does not need to change the runtime type of a message. A Message can stay as is. The persistor in the SDK takes care of splitting and merging languages. I don't expect splitting and merging of languages to leak outside the persistor.

Pros

lowers risk for merge conflicts
easy version history per language
push infra problems to @inlang/lix and help lix build the right thing

Cons

persistor logic becomes more complicated because it needs to split and merge languages
maybe short-term infra issues until they are fixed in lix

Dec 18 '23 19:12 samuelstroschein

@martin-lysk summary of the call:

go with your flatter structure but have a v1 at the root to ease migrations

- project.inlang
 - messages
  - v1
    - [part1]
     - [part2_part3].json

@inlang/ide-extension and @inlang/editor must show aliases in the UI

CleanShot 2023-12-18 at 16 58 25@2x

Dec 18 '23 23:12 samuelstroschein

The @inlang/ide-extension will not have any problems displaying the alias, we just have to find an additional layer of editing and displaying. The requirement for a full-fledged editor in the @inlang/ide-extension is now even more pressing.

Dec 19 '23 09:12 felixhaeberle

The "Alias" property must also be editable and will probably used as description text in many cases.

Dec 19 '23 09:12 NiklasBuchfink

@martin-lysk @LorisSigrist Created the issue https://github.com/inlang/monorepo/issues/1920#issue-2048843732

Dec 19 '23 15:12 samuelstroschein

notes of call with @martin-lysk and @felixhaeberle:

introduce "alias.default" to avoid multiple aliases that are all identical
do NOT introduce name property to reduce number of concepts from 3 (id, name, alias) to 2 (id, alias)
documentation needs to educate that "aliases" are legacy for existing projects AND SHOULD NOT be used for new projects
introduce getMessageByAlias() query function. good job for @jldec

Jan 16 '24 19:01 samuelstroschein

What is behind the term alias.default? That people need to always define a alias? I guess we want to document the change in the SDK docs right? But we also need to think about the message extraction/creation processes. The people will probably not look into the docs for that.

Jan 17 '24 07:01 NilsJacobsen

@NilsJacobsen no, people will not need to define a default alias. But if they do use aliases, they can choose "a default alias" that works across multiple apps.

But we also need to think about the message extraction/creation processes. The people will probably not look into the docs for that.

Messages are extracted with no alias.

Aliases only exists for legacy projects. The docs should contain a warning "only use aliases for existing projects. Using aliases is flawed for reason 1,2,3"

Jan 17 '24 14:01 samuelstroschein

Is there a time frame for this implementation? I don't want to put pressure on you; I think that needs to be done well, but I wondered if you have a time frame in mind. I don't want to be the bottleneck with manage create, but I also don't want to start too early. @martin-lysk @jldec

Jan 22 '24 09:01 NilsJacobsen

Is there a time frame for this implementation? I don't want to put pressure on you; I think that needs to be done well, but I wondered if you have a time frame in mind. I don't want to be the bottleneck with manage create, but I also don't want to start too early. @martin-lysk @jldec

Great you ask @NilsJacobsen Quick update on this:

After a chat with @janfjohannes and a closer look at Lix it seems like this will no longer block / blocked by lazy fetching. Since batch fetching will come in the first version FIY @samuelstroschein

With a closer look at the VS-Code extension together with Felix last week we found some additional requirements:

Add support for aliases to VS code extension @felixhaeberle - which needs a way to query by aliases as well - additional task for the sdk @jldec - shall we discuss how we move forward with this?
Check how the system behaves with multiple sdk's pointing to the same inlang project, multiple processes watching on the same inlang project folder.

Beyond the vs code specific things it nails down to smaller tasks that i plan to tackle this week:

[ ] Update word list from the spreadsheat
[ ] test and fix implementation on windows
[ ] sort output before entering save plugin

Jan 22 '24 14:01 martin-lysk