mastodon
mastodon copied to clipboard
Autocomplete hashtags to camel case
Pitch
Have the API send out hashtags in camelcase. For example, it would be better if users autocompleted and searched for #EiffelTower than the current result, #eiffeltower.
Motivation
this is an accessibility issue. screenreaders have much more difficulty parsing hashtags that are not in camelcase.
If I remember correctly, the first encountered case variant is considered the “canonical” one by the instance, and can be changed by a server administrator. This is the one that is used in completions when you have yourself not used the hashtag yourself. I think it's then supposed to stick with what you yourself used.
Some good arguments for camel case (aside from that camels are cool) - https://www.picklejarcommunications.com/blog/why-you-should-use-camel-case-for-your-hashtags/
We don't need to be convinced that people should use CamelCase for hashtags! But it's not something we can enforce at the software level, software only knows hashtags as users spell them out.
is it really infeasible to improve the situation on the software side? for example, autocomplete could always offer a few suggestions that adjust capitalization, based on checks against concatenations of dictionary words
- would making that the default be something that can be done on the software side?
On Thu, Nov 10, 2022, 7:20 AM Claire @.***> wrote:
We don't need to be convinced that people should use CamelCase for hashtags! But it's not something we can enforce at the software level, software only knows hashtags as users spell them out.
— Reply to this email directly, view it on GitHub https://github.com/mastodon/mastodon/issues/20267#issuecomment-1310201238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIKNUJOWHDUPR4U3EYHOG3WHTR7NANCNFSM6AAAAAAR3Y7ZKY . You are receiving this because you authored the thread.Message ID: @.***>
@SeaGriff I like that idea, but given the range of languages that Mastodon works with that might be difficult.
Perhaps an easier way to consider this is to simply make a suggestion to users when a hashtag is over say 15 characters.
So if someone wrote #digitalgovernment they would get a little popup saying "Consider using camel case to make your hashtag more readable".
But if someone wrote #digitalgov nothing would happen as it is too short. This would be a way to actively remind users, simply based on the length of the text.
@ClearlyClaire It's not about enforcing CamelCase, but simply reminding users that it is a best practice that will make their message more readable for everyone. People should be able to do what ever they want, but...
A middle ground: If the first hashtag used by the user is canonical but lower case entirely, and the user later manually uses (or perhaps, as a form of crowdsourcing, if the federated feed sees) a camel case hashtag in the same grouping (as detected by regex location of capitals, not a dictionary), with the same characters, the new one should overwrite the original as the new preferred one. (I wouldn't apply this to upper case as it might confuse acronyms and abbreviations).
@veale, this is what @ClearlyClaire said is already the case in the first reply, I think?
No, not completely! What I understand from @veale's suggestion is that basically, the first instance of non-all-lowercase spelling of the hashtag encountered by the server becomes the server-wide “canonical” case.
what is the feasibility of tracking the number of times various spellings are used? the most popular, say, two not-all-lowercase spellings could both be canonical cases, with purely lowercase spellings as fallbacks
Not sure we need a perfect solution, just a better one.
Would be useful to know that the hashtags are case-insensitive when it comes to aggregating feeds. That #CaseSensitivePotato, #CaseSensitivepotato & #casesensitivepotato will all show up in the same feed.
Yet we want to encourage the use of #CaseSensitivePotato vs all of the others (as it is the most readable).
Could we just pick the instance with the most instances of upper case characters? Do we have to worry about #CaSeSeNsiTivePoTatO? Probably not.. But maybe..
Do we have to worry about #CaSeSeNsiTivePoTatO?
I think we do need to worry about #CaseSensitivePOtato
, though. And #CASESENSITIVEPOTATO
.
@erbridge There is indeeed a fair amount to worry about. At least:
- Problems with locking in the 'first you see' approach:
- #miSpelling > #mispelling (single words incorrectly capitalised) With a 'first you see becomes canonical' approach on a server level, people could propagate errors like this. I don't think it's really a big concern if this happens locally (esp if people can edit their hashtags).
- No reliable way to rank camel case-esque results correctly with regex approaches. You could try and penalise consecutive capitals but that would create the following issues
- #PartyInTheUSA < #PartyInTheUsa (acronyms/abbreviations)
- #WhatAUIMess < #WhatAuiMess
Measuring on popularity might help compensate for errors but is likely a trickier engineering task.
avoiding exactly these issues is what I had in mind when suggesting measuring on popularity. in particular I think offering at least the two most popular capitalizations should make weird corner cases self correcting. is it much trickier to track how often a hashtag has been used than whether it has been used?
Why can't it simply be set to accept the capitalization that the user has used? I came here because I was typing #LanguageLearning and mastodon wanted to change it to #languagelearning. Why not simply treat the two as the same and leave my preferred capitalization alone when acknowledging I am using the hashtag the way I prefer? Isn't this easier than the software having to decide which is correct? The problem now is that it is "correcting" me towards an option I don't want.
What about tags that might be split into words more than one way, such as #PenIsland?
@kerim because sometimes people will have a hashtag they have never written before autopredicted, and it's a good idea to prefer an accessible autoprediction over an inaccessible one.
Could you use the number instances of a hashtag's capitalisation to predict which is the best? e.g. if ONLY lowercase and CamelCase (without consecutive capitals) versions exist, then it's fairly reasonable to use the camel case version.
If there are 3 versions of #PartyInTheUSA (#partyintheusa, #partyInTheUsa), then if the hashtag has more than 3 uses its almost certainly going to be the first version that has the most uses, in which case just use that.
This would avoid the problem of #gOOfYcAsE to a large degree, because it's unlikely that two versions of that would have exactly the same capitalisation.
There may also be some inference you can make from the initial letter's capitalisation? not sure.
Also, please change the issue capitalisation! :stuck_out_tongue_closed_eyes:
To change the existing collection of hashtags
- Set up a central reference instance for hashtags in CamelCase writing.
- Federated instances periodically check their hashtag collection against the reference.
- Zero risk if the reference server is down.
To find the correct Capitalization
- Interested users can post their suggestion to the reference instance.
- The reference instance counts out the winner.
- This would cover the language problem as well.
And yes, please change "Camelcase" to "CamelCase" in the title of the current issue :o)
See also #19692.
Scoring shouldn't be too hard.
Splitting on case change boundaries is how my check-spelling project works.
For language tagged content it's then not particularly painful to check to see if a given split item is in the dictionary.
You definitely want to give extra points to each word, negative points for each non-word, points for each time a given casing is used (this might need to be weighted). There's probably an argument for a minimum word length for points/non-points. I moved the check-spelling project from 2 to 3 letters for the shortest word a while ago.
This is an interesting case: #miSpelling > #mispelling
The former should win, since the proper word is misspelling
. For check-spelling, it would flag #mispelling
and be absolutely silent on #miSpelling
.
Hash tag | tokens | subscores | score |
---|---|---|---|
#miSpelling |
mi ,spelling |
0, 1 | 1 |
#mispelling |
mispelling |
-1 | -1 |
For the USA case, it's possible to detect runs of uppercase and have a dictionary that's aware of uppercase words. (Again, speaking from experience.)
Do people generally only put the uppercase word at the end of hashtags?
In code, we'll get things like TLSInterface
and my code parses it as TLS
+Interface
Do people generally only put the uppercase word at the end of hashtags?
Since people aren't robots there will never be something like "generally".
We have "PascalCase" and "camelCase". Th latter is often used for both by non-techies (like in the current discusion). For our purpose the difference does not really matter. An algorithm could consider them as identical when counting out used capitalization.
I'm asking about #GoUSAGo
or #MessiFTWinWCS
or something like that.
Scoring #TitleCase
and #camelCase
together in terms of favoring them collectively over #caseless
seems reasonable.
My naive preference for a tie breaker in case the counts for #TitleCase
and #camelCase
are the same would be the former -- roughly speaking my English hat trumps my programming hat.
OK, now I understand what you are talking about.
For the same reason I think there is no common rule. And I am not sure whether the different screen readers have a common ruleset how to divide these hashtags. From my perspective these are edge cases, far less important than the main problem.
Basically follow the advice folks have been giving to educate social media managers how to build accessible hashtags for other social media sites:
https://www.abilitynet.org.uk/news-blogs/5-ways-make-your-tweets-accessible https://aem.cast.org/create/creating-accessible-social-media https://ecampusontario.pressbooks.pub/accessibledigitalcontenttraining/chapter/accessible-use-of-camelcase-and-structuring-posts/ https://www.torontomu.ca/accessibility/guides-resources/social-media/ https://www.rnib.org.uk/living-with-sight-loss/assistive-aids-and-technology/everyday-tech/navigation-and-communication/guide-to-accessible-social-media/ https://accessibility.princeton.edu/guidelines/social-media https://averment.medium.com/why-does-writing-your-hashtags-in-camel-case-make-them-more-accessible-and-what-are-the-benefits-9e3b8e13e920
@kerim @veale Actually, it does accept what the user has typed. If I ignore the autosuggestion and keep typing, it leaves my capitalization intact.
Steps:
- Type #AltText (capital A, capital T)
Desired result: Autosuggest suggests #AltText (capital A, capital T) Actual result: Autosuggest suggests #alttext (all lower case)
- Type space
Desired and actual results: The text "#AltText" (capital A, capital T) appears in my toot edit field.
That said, it is a distraction.
@ChasBelov It accepts it, but the issue is about preferring it, to encourage use.
This will help everyone quickly decipher a hashtag.
Would it be quick win on this front to have an admin or user option (enabled by default?) to disallow all-lowercase hashtags when posts are submitted, to force use of CamelCase?