takahe icon indicating copy to clipboard operation
takahe copied to clipboard

Support setting language on posts

Open chdorner opened this issue 1 year ago • 9 comments

I think this is everything that's needed to implement Mastodon's post language feature.

  • New identity config setting for the preferred language, will be used when creating local posts without a language key in the request.
  • Support for the Mastodon API, either creating/updating a post, or wherever posts are rendered
  • Support for ActivityPub federation via contentMap
  • Helps screenreaders on the web frontend by setting the lang attribute on the post content

Note on client compatibility:

  • Ivory doesn't seem to support any of this so far
  • Elk
    • allows to set the language on each post
    • doesn't use the posting:default:language preference, but uses its interface language as the default value when creating a new post
    • viewing a post in a different language, it offers to translate it

chdorner avatar May 15 '23 10:05 chdorner

LGTM

You can also update the content_vector_gin to use the new language field in the SearchVector's config attribute instead of the default english, as done in the djangoproject.com search document: https://github.com/django/djangoproject.com/blob/main/docs/search.py#L42

pauloxnet avatar May 15 '23 19:05 pauloxnet

@pauloxnet interesting. just to make sure that I'm getting this right since the Mastodon languages are stored as 2-char strings (ISO-639-1, en, de, etc.), we'd have to store an extra column with the search config translating these values to the known postgres search configs (looks like there are 28, plus simple), and then recreate the search index pointing the config to that column instead of hardcoding the english config

chdorner avatar May 15 '23 20:05 chdorner

It seems right.

In the Django Project code I used a dictionary to map 2 characters long iso languages into language names for PostgreSQL config.

Maybe there's a similar way to map language code into config names without an additional fiepd?

pauloxnet avatar May 15 '23 21:05 pauloxnet

I think the problem with changing the SearchVector language is that it's embedded into the index, is it not? We can't have 20-odd indexes on the content of posts, one for each language, and doing a search query without an index for it sounds painful.

andrewgodwin avatar May 15 '23 22:05 andrewgodwin

I think the problem with changing the SearchVector language is that it's embedded into the index, is it not?

Actually, I think you can have an index based on the language stored in a column in the same table, but I'd leave the change until after this PR is merged.

pauloxnet avatar May 16 '23 07:05 pauloxnet

Did a bit more research on adding the tsvector config to the index for this. I can get it to work with just SQL, but having a hard time to get Django to do this for me. Any help would be appreciated!

-- adding a column of type `regconfig` to store each record's tsvector config
ALTER TABLE activities_post ADD COLUMN tsvector_config regconfig DEFAULT 'simple';

-- creating the GIN index
DROP INDEX content_vector_gin;
CREATE INDEX content_vector_gin ON activities_post USING GIN (to_tsvector(tsvector_config, content));

chdorner avatar May 29 '23 15:05 chdorner

The SQL code is ok for me. What's your issue in generating that same code with the Django ORM ? I can guess the regconfig type of column?

pauloxnet avatar May 29 '23 23:05 pauloxnet

@chdorner since the language field is no longer nullable, some type hints and checks on None need to be updated.

pauloxnet avatar Jun 14 '23 07:06 pauloxnet

@chdorner @AstraLuma this is absolutely great feature. any chance get this updated / merged? happy to do anything I can to help.

alphatownsman avatar Feb 10 '24 15:02 alphatownsman