takahe
takahe copied to clipboard
Support setting language on posts
I think this is everything that's needed to implement Mastodon's post language feature.
- New identity config setting for the preferred language, will be used when creating local posts without a
language
key in the request. - Support for the Mastodon API, either creating/updating a post, or wherever posts are rendered
- Support for ActivityPub federation via
contentMap
- Helps screenreaders on the web frontend by setting the
lang
attribute on the post content
Note on client compatibility:
- Ivory doesn't seem to support any of this so far
- Elk
- allows to set the language on each post
- doesn't use the
posting:default:language
preference, but uses its interface language as the default value when creating a new post - viewing a post in a different language, it offers to translate it
LGTM
You can also update the content_vector_gin to use the new language
field in the SearchVector
's config
attribute instead of the default english
, as done in the djangoproject.com search document:
https://github.com/django/djangoproject.com/blob/main/docs/search.py#L42
@pauloxnet interesting. just to make sure that I'm getting this right since the Mastodon languages are stored as 2-char strings (ISO-639-1, en
, de
, etc.), we'd have to store an extra column with the search config translating these values to the known postgres search configs (looks like there are 28, plus simple
), and then recreate the search index pointing the config to that column instead of hardcoding the english config
It seems right.
In the Django Project code I used a dictionary to map 2 characters long iso languages into language names for PostgreSQL config.
Maybe there's a similar way to map language code into config names without an additional fiepd?
I think the problem with changing the SearchVector language is that it's embedded into the index, is it not? We can't have 20-odd indexes on the content of posts, one for each language, and doing a search query without an index for it sounds painful.
I think the problem with changing the SearchVector language is that it's embedded into the index, is it not?
Actually, I think you can have an index based on the language stored in a column in the same table, but I'd leave the change until after this PR is merged.
Did a bit more research on adding the tsvector
config to the index for this. I can get it to work with just SQL, but having a hard time to get Django to do this for me. Any help would be appreciated!
-- adding a column of type `regconfig` to store each record's tsvector config
ALTER TABLE activities_post ADD COLUMN tsvector_config regconfig DEFAULT 'simple';
-- creating the GIN index
DROP INDEX content_vector_gin;
CREATE INDEX content_vector_gin ON activities_post USING GIN (to_tsvector(tsvector_config, content));
The SQL code is ok for me. What's your issue in generating that same code with the Django ORM ? I can guess the regconfig type of column?
@chdorner since the language
field is no longer nullable, some type hints and checks on None
need to be updated.
@chdorner @AstraLuma this is absolutely great feature. any chance get this updated / merged? happy to do anything I can to help.