mastodon Autocomplete hashtags to camel case

Pitch

Have the API send out hashtags in camelcase. For example, it would be better if users autocompleted and searched for #EiffelTower than the current result, #eiffeltower.

Motivation

this is an accessibility issue. screenreaders have much more difficulty parsing hashtags that are not in camelcase.

Nov 09 '22 19:11 SeaGriff

If I remember correctly, the first encountered case variant is considered the “canonical” one by the instance, and can be changed by a server administrator. This is the one that is used in completions when you have yourself not used the hashtag yourself. I think it's then supposed to stick with what you yourself used.

Nov 10 '22 10:11 ClearlyClaire

Some good arguments for camel case (aside from that camels are cool) - https://www.picklejarcommunications.com/blog/why-you-should-use-camel-case-for-your-hashtags/

Nov 10 '22 12:11 mgifford

We don't need to be convinced that people should use CamelCase for hashtags! But it's not something we can enforce at the software level, software only knows hashtags as users spell them out.

Nov 10 '22 12:11 ClearlyClaire

is it really infeasible to improve the situation on the software side? for example, autocomplete could always offer a few suggestions that adjust capitalization, based on checks against concatenations of dictionary words

would making that the default be something that can be done on the software side?

On Thu, Nov 10, 2022, 7:20 AM Claire @.***> wrote:

We don't need to be convinced that people should use CamelCase for hashtags! But it's not something we can enforce at the software level, software only knows hashtags as users spell them out.

— Reply to this email directly, view it on GitHub https://github.com/mastodon/mastodon/issues/20267#issuecomment-1310201238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIKNUJOWHDUPR4U3EYHOG3WHTR7NANCNFSM6AAAAAAR3Y7ZKY . You are receiving this because you authored the thread.Message ID: @.***>

Nov 10 '22 13:11 SeaGriff

@SeaGriff I like that idea, but given the range of languages that Mastodon works with that might be difficult.

Perhaps an easier way to consider this is to simply make a suggestion to users when a hashtag is over say 15 characters.

So if someone wrote #digitalgovernment they would get a little popup saying "Consider using camel case to make your hashtag more readable".

But if someone wrote #digitalgov nothing would happen as it is too short. This would be a way to actively remind users, simply based on the length of the text.

@ClearlyClaire It's not about enforcing CamelCase, but simply reminding users that it is a best practice that will make their message more readable for everyone. People should be able to do what ever they want, but...

Nov 10 '22 16:11 mgifford

A middle ground: If the first hashtag used by the user is canonical but lower case entirely, and the user later manually uses (or perhaps, as a form of crowdsourcing, if the federated feed sees) a camel case hashtag in the same grouping (as detected by regex location of capitals, not a dictionary), with the same characters, the new one should overwrite the original as the new preferred one. (I wouldn't apply this to upper case as it might confuse acronyms and abbreviations).

Nov 10 '22 16:11 veale

@veale, this is what @ClearlyClaire said is already the case in the first reply, I think?

Nov 10 '22 21:11 Cassolotl

No, not completely! What I understand from @veale's suggestion is that basically, the first instance of non-all-lowercase spelling of the hashtag encountered by the server becomes the server-wide “canonical” case.

Nov 10 '22 21:11 ClearlyClaire

what is the feasibility of tracking the number of times various spellings are used? the most popular, say, two not-all-lowercase spellings could both be canonical cases, with purely lowercase spellings as fallbacks

Nov 10 '22 23:11 SeaGriff

Not sure we need a perfect solution, just a better one.

Would be useful to know that the hashtags are case-insensitive when it comes to aggregating feeds. That #CaseSensitivePotato, #CaseSensitivepotato & #casesensitivepotato will all show up in the same feed.

Yet we want to encourage the use of #CaseSensitivePotato vs all of the others (as it is the most readable).

Could we just pick the instance with the most instances of upper case characters? Do we have to worry about #CaSeSeNsiTivePoTatO? Probably not.. But maybe..

Nov 11 '22 01:11 mgifford

Do we have to worry about #CaSeSeNsiTivePoTatO?

I think we do need to worry about #CaseSensitivePOtato, though. And #CASESENSITIVEPOTATO.

Nov 11 '22 21:11 erbridge

@erbridge There is indeeed a fair amount to worry about. At least:

Problems with locking in the 'first you see' approach:

#miSpelling > #mispelling (single words incorrectly capitalised) With a 'first you see becomes canonical' approach on a server level, people could propagate errors like this. I don't think it's really a big concern if this happens locally (esp if people can edit their hashtags).

No reliable way to rank camel case-esque results correctly with regex approaches. You could try and penalise consecutive capitals but that would create the following issues

#PartyInTheUSA < #PartyInTheUsa (acronyms/abbreviations)
#WhatAUIMess < #WhatAuiMess

Measuring on popularity might help compensate for errors but is likely a trickier engineering task.

Nov 12 '22 10:11 veale

avoiding exactly these issues is what I had in mind when suggesting measuring on popularity. in particular I think offering at least the two most popular capitalizations should make weird corner cases self correcting. is it much trickier to track how often a hashtag has been used than whether it has been used?

Nov 12 '22 22:11 SeaGriff

Why can't it simply be set to accept the capitalization that the user has used? I came here because I was typing #LanguageLearning and mastodon wanted to change it to #languagelearning. Why not simply treat the two as the same and leave my preferred capitalization alone when acknowledging I am using the hashtag the way I prefer? Isn't this easier than the software having to decide which is correct? The problem now is that it is "correcting" me towards an option I don't want.

Nov 21 '22 01:11 kerim

What about tags that might be split into words more than one way, such as #PenIsland?

Nov 21 '22 05:11 eternaldensity

@kerim because sometimes people will have a hashtag they have never written before autopredicted, and it's a good idea to prefer an accessible autoprediction over an inaccessible one.

Nov 21 '22 10:11 veale

Could you use the number instances of a hashtag's capitalisation to predict which is the best? e.g. if ONLY lowercase and CamelCase (without consecutive capitals) versions exist, then it's fairly reasonable to use the camel case version.

If there are 3 versions of #PartyInTheUSA (#partyintheusa, #partyInTheUsa), then if the hashtag has more than 3 uses its almost certainly going to be the first version that has the most uses, in which case just use that.

This would avoid the problem of #gOOfYcAsE to a large degree, because it's unlikely that two versions of that would have exactly the same capitalisation.

There may also be some inference you can make from the initial letter's capitalisation? not sure.

Dec 01 '22 00:12 naught101

Also, please change the issue capitalisation! :stuck_out_tongue_closed_eyes:

Dec 01 '22 00:12 naught101

To change the existing collection of hashtags

Set up a central reference instance for hashtags in CamelCase writing.
Federated instances periodically check their hashtag collection against the reference.
Zero risk if the reference server is down.

To find the correct Capitalization

Interested users can post their suggestion to the reference instance.
The reference instance counts out the winner.
This would cover the language problem as well.

And yes, please change "Camelcase" to "CamelCase" in the title of the current issue :o)

Dec 20 '22 19:12 SmallBlueElephant

See also #19692.

Dec 21 '22 14:12 lpar

Scoring shouldn't be too hard.

Splitting on case change boundaries is how my check-spelling project works.

For language tagged content it's then not particularly painful to check to see if a given split item is in the dictionary.

You definitely want to give extra points to each word, negative points for each non-word, points for each time a given casing is used (this might need to be weighted). There's probably an argument for a minimum word length for points/non-points. I moved the check-spelling project from 2 to 3 letters for the shortest word a while ago.

This is an interesting case: #miSpelling > #mispelling

The former should win, since the proper word is misspelling. For check-spelling, it would flag #mispelling and be absolutely silent on #miSpelling.

Hash tag	tokens	subscores	score
`#miSpelling`	`mi`,`spelling`	0, 1	1
`#mispelling`	`mispelling`	-1	-1

Dec 21 '22 15:12 jsoref

For the USA case, it's possible to detect runs of uppercase and have a dictionary that's aware of uppercase words. (Again, speaking from experience.)

Do people generally only put the uppercase word at the end of hashtags?

In code, we'll get things like TLSInterface and my code parses it as TLS+Interface

Dec 21 '22 15:12 jsoref

Do people generally only put the uppercase word at the end of hashtags?

Since people aren't robots there will never be something like "generally".

We have "PascalCase" and "camelCase". Th latter is often used for both by non-techies (like in the current discusion). For our purpose the difference does not really matter. An algorithm could consider them as identical when counting out used capitalization.

Dec 21 '22 16:12 SmallBlueElephant

I'm asking about #GoUSAGo or #MessiFTWinWCS or something like that.

Scoring #TitleCase and #camelCase together in terms of favoring them collectively over #caseless seems reasonable.

My naive preference for a tie breaker in case the counts for #TitleCase and #camelCase are the same would be the former -- roughly speaking my English hat trumps my programming hat.

Dec 21 '22 16:12 jsoref

OK, now I understand what you are talking about.

For the same reason I think there is no common rule. And I am not sure whether the different screen readers have a common ruleset how to divide these hashtags. From my perspective these are edge cases, far less important than the main problem.

Dec 21 '22 18:12 SmallBlueElephant

Basically follow the advice folks have been giving to educate social media managers how to build accessible hashtags for other social media sites:

https://www.abilitynet.org.uk/news-blogs/5-ways-make-your-tweets-accessible https://aem.cast.org/create/creating-accessible-social-media https://ecampusontario.pressbooks.pub/accessibledigitalcontenttraining/chapter/accessible-use-of-camelcase-and-structuring-posts/ https://www.torontomu.ca/accessibility/guides-resources/social-media/ https://www.rnib.org.uk/living-with-sight-loss/assistive-aids-and-technology/everyday-tech/navigation-and-communication/guide-to-accessible-social-media/ https://accessibility.princeton.edu/guidelines/social-media https://averment.medium.com/why-does-writing-your-hashtags-in-camel-case-make-them-more-accessible-and-what-are-the-benefits-9e3b8e13e920

Dec 21 '22 19:12 mgifford

@kerim @veale Actually, it does accept what the user has typed. If I ignore the autosuggestion and keep typing, it leaves my capitalization intact.

Steps:

Type #AltText (capital A, capital T)

Desired result: Autosuggest suggests #AltText (capital A, capital T) Actual result: Autosuggest suggests #alttext (all lower case)

Type space

Desired and actual results: The text "#AltText" (capital A, capital T) appears in my toot edit field.

That said, it is a distraction.

Dec 31 '22 21:12 ChasBelov

@ChasBelov It accepts it, but the issue is about preferring it, to encourage use.

Jan 03 '23 19:01 veale

This will help everyone quickly decipher a hashtag.

Feb 07 '23 18:02 johnsamuelwrites

Would it be quick win on this front to have an admin or user option (enabled by default?) to disallow all-lowercase hashtags when posts are submitted, to force use of CamelCase?

Mar 08 '23 08:03 Floppy

mastodon mastodon copied to clipboard

Autocomplete hashtags to camel case

Pitch

Motivation

mastodon
mastodon copied to clipboard