lemmy icon indicating copy to clipboard operation
lemmy copied to clipboard

Tag posts with Language

Open Nutomic opened this issue 4 years ago • 9 comments

Similar to the way Peertube does it, post creation could have a language dropbown. That would allow users to filter by language, and only see those posts they can understand. This would make it easier to have multiple different languages in the same server/community.

Reddit doesnt support this at all, so users in other languages have to go to their own subreddit, and all the big ones are completely dominated by English. Honestly what you expect from an American company when it comes to language support, so I think we can do much better.

Edit: relevant section in activitypub spec

Nutomic avatar Jan 18 '20 21:01 Nutomic

I'm not saying no to this, I'm still open to it, but here are my concerns:

  • Lemmy, unlike the twitter / person follow variants, already has the concept of communities, so its very likely that non-english communities would work just like reddit: communities would implicitly be in a certain language. If you don't speak that language, you probably won't subscribe to that community, and thus it won't clutter up your feed. IE, there is already a language selection in place, via community subscriptions.
    • Also the possibility too that entire instances will be in different languages, and same as above applies: you won't subscribe or sign up unless you speak that language.
  • It would require users now to explicitly list the languages they speak, and whether they want to block from seeing languages they don't speak. Seems overly complicated compared to just unsubbing from ones they don't speak.
  • Multi-lingual things. On some cooler reddit threads, I've seen multiple languages used in a single post.
  • It would require posts to now provide that language, which to me just seems wrong... if I'm writing in spanish, why do I need to give that language. Those who care will see that its spanish.
    • If we apply language standards to posts, why not communities, and comments, which are arguably just as important as posts. If we did implement something like this, I'd rather do it at a community level, rather than the post level.

dessalines avatar Jan 18 '20 21:01 dessalines

Lemmy, unlike the twitter / person follow variants, already has the concept of communities, so its very likely that non-english communities would work just like reddit: communities would implicitly be in a certain language. If you don't speak that language, you probably won't subscribe to that community, and thus it won't clutter up your feed. IE, there is already a language selection in place, via community subscriptions.

I dont think thats good enough. Keep in mind that most people in the world speak more than one language. And there are a lot of regions where more than one language is spoken. As a concrete example, I live in the Basque country, where Basque and Spanish are official languages (both are spoken by a lot of people). Making seperate Lemmy communities makes little sense in that context, and splits the userbase for no good reason.

Also the possibility too that entire instances will be in different languages, and same as above applies: you won't subscribe or sign up unless you speak that language.

Sure, but we will also federate with those instances, and then we need a mechanism to hide those posts from users who dont speak the language.

It would require users now to explicitly list the languages they speak, and whether they want to block from seeing languages they don't speak. Seems overly complicated compared to just unsubbing from ones they don't speak.

Seems easy enough on Peertube

Multi-lingual things. On some cooler reddit threads, I've seen multiple languages used in a single post.

Then we can also add a language to comments. Or even auto-detect the language (Mastodon does that afaik).

It would require posts to now provide that language, which to me just seems wrong... if I'm writing in spanish, why do I need to give that language. Those who care will see that its spanish.

As above, we could use some automated tool to assign a language to every post (considering the languages the user has in their profile).

If we apply language standards to posts, why not communities, and comments, which are arguably just as important as posts. If we did implement something like this, I'd rather do it at a community level, rather than the post level.

Agreed, we will probably need it on all levels at some point. But we can start with a more basic implementation.

Nutomic avatar Jan 18 '20 21:01 Nutomic

By the way, we're running peertube.social with multiple languages and its working fine. We have a couple of mods who speak different languages, and if a language is not spoken, I've made good experiences with asking people on Mastodon for moderation advice.

Nutomic avatar Jan 18 '20 21:01 Nutomic

As a native non-English speaker, but an English language consumer, it would benefit me to have some kind of option to at least filter content based no possible languages.

For instance, right now i see on the DEV instance from Lemmy that there are a lot of Spanish posts, there is no way for me to understand any of that. So to have some kind of mechanism to be able to filter those posts out, would be helpful.

Another way that tagging a post with a language (or multiple languages), compared to communities, could be beneficial is that we have more control over easier translation, provide better meta data about which language a certain post is in etc.

This wouldn't prevent communities from being started that are focused on a particular language.

richardj avatar Jan 19 '20 10:01 richardj

One thing that would make me happy here, is if the language detection of content was automatic... maybe using something like this: https://github.com/wooorm/franc

But that also scares me, because the language detection should be on the back-end, since any future clients wouldn't necessarily be in javascript.

Even aside from that, it means that every major content table now needs a bridge table (since a single post / comment can have multiple languages) like :

post_language : post_id, lang_id

dessalines avatar Jan 19 '20 14:01 dessalines

Language detection should definitely be on the backend, anything else doesnt make much sense. I found this Rust library which looks good. We can also give it a blacklist/whitelist of languages based on the user profile, which should make detection more accurate.

https://github.com/greyblake/whatlang-rs

I dont think we need to support multiple languages in a single post/comment, at least not for now.

Nutomic avatar Jan 19 '20 15:01 Nutomic

I'd be okay with implementing this, at least on the principle of things only being tagged with one language. Some things I could see this needing:

  • [ ] Test out whatlang-rs, make sure it works decently with a variety of sentences. Choose a threshold at which the language isn't unknown, which is default. Maybe 70% certainty.
  • [ ] Create a language table, with a list of supported languages, as well as unknown.
  • [ ] Create a DB migration adding a language column to the community, post, and comment tables, defaulting that column to unknown.
  • [ ] Create a one-off job to run whatlang on the current content of those tables to set the language.
  • [ ] Alter the API to identify the language and add that to the insert / update statements.
  • [ ] Add a user_language table for the languages a user speaks. Add this as a user setting on the front end, where a user selects their languages.
  • [ ] Alter all view fetches to inner join with their chosen languages, as well as unknown.
  • [ ] Do extensive testing.

dessalines avatar Jan 27 '20 17:01 dessalines

Here is a library which might be helpful (in addition to manual language selection).

https://github.com/pemistahl/lingua-rs

Nutomic avatar Nov 16 '20 22:11 Nutomic

I updated the first comment with details on how to implement this. In fact its not that complicated, mainly needs some changes in the database code and in the frontend. I wouldnt use automated language detection for the initial version, because in most cases we can already guess the language based on the parent post/comment language (or remember the last selection for posts).

Nutomic avatar Mar 29 '21 11:03 Nutomic