Enable searching of all toots
Pitch
Currently, searching on Mastodon in possible by linking an Elasticsearch instance. When enabled users are able to search the content of posts they have made, or have interacted with. Posts the user has no direct connection with are filtered out of the search results.
From other issues discussing it, the potential for searching to be abused is stated as an important reason this should never be implemented. In my view this is disproportionate and likely counter-productive. For example, I am an admin on my server and I am unable to use built-in search functions to identify violating content so cannot be proactive in identifying abuse.
As search is backed by elastic, I'm free to write my own frontend to search content without using the mastodon interface. This would be easy enough to do, but it fractures the user experience and creates security issues - particularly in exposing search functionality to other users.
Motivation
As much as the potential to proactively identify abuse, the ability to search for content would be a big bonus for any user.
I think the idea that abuse might happen therefore it shouldn't be implemented it is patronising. Search functionality is already optional, and it would be better to fully support it within the application than lose control of user data through external integrations. If a server does experience abuse as a result of allowing search it should be able to deactivate it easily, or restrict the ability of individual user accounts to search content that isn't their own.
From https://mastodon.social/@Gargron/4947733 :
If text search is ever implemented, it should be limited to your home timeline/mentions only. Lack of full-text search on general content is intentional, due to negative social dynamics of it in other networks
Giving an admin the ability to search their own instance for content should be okay even though not explicitly mentioned (user safety etc). It'll probably be better as a plugin/extra app server admins can install alongside the main code which reads the PostgreSQL db and just chucks it into ElasticSearch.
I'd read the quoted before but I think we shouldn't have to take it as gospel and I disagree with it on a number of levels. If implementing it is controversial then it should be made opt-in. Trolls are going to find ways to troll, making them easier to identify make them easier to ban.
Overlaps with #9529.
I agree this should be an option. For example, on https://social.network.europa.eu/ , all posts are public information (they'd even be subject to access requests) and most aren't posted by individuals. The original justification for not allowing to search across all posts simply doesn't apply here.
Searching being seemingly intentionally limited is the most currently limiting factor to Mastodon adoption and versatility.
I wish to discover and see if someone else is having a similar thought, problem or solution and to share complex ideas but by design you can't unless you luck out to get tooted in a wall of spam. Only if you can sum it up into a single hashtag you may, but hashtags are VERY limited. A fuzzy or advanced indexed search feature is something that is necessary for connecting people together in a large crowd.
Lacking search encourages:
- Hashtag spamming to be discovered. They makes posts cluttered and more difficult to read.
- Repeatedly tooting the same thing to be discovered, or copy pasting others material.
- Joining the wall of unreadable spam that local can be to spot something in a crowd.
- Smaller instances with narrow focus.
- Simpler word usage and basic concepts.
Lacking search discourages:
- Finding similar concepts without spamming (or understanding) hashtags.
- Specific problem solving. Looking up if someone shares your plight.
- Larger instances with wider focus.
- Discovery via searching.
- Looking up ongoing topics or emergencies or events lacking a hashtags.
- Sharing complicated ideas with multiple keywords that only make sense together.
I can see how some are concerned with this being a feature fearing abuse and harassment but I don't think the negative overweigh the positive. It should at least be an opt-in or opt-out feature and something each instance could set their default value for. Technical or development focused instances should have this on by default. Try to imagine GitHub or Slack without searching?
Many wish to be discovered through their posts and being given that choice yourself is empowering. Not having that choice at all is a lacked opportunity in finding another way to connect.
The statuses are all in ES.. we just need to add something like searchable_by: "everyone" to include into ES results. Why is it not an account preference? For example, bot accounts are likely to want to be searchable by everyone.. as are organizational posts.
Also, we need statues creation dates in ES. So some form of ordering can be done with the queries.
My thoughts re search scope & user expectations from a UX perspective rather than abuse.
I appreciate search is historically controversial but it doesn't help understanding if people just downvote. If there is a nuanced problem to be solved it needs better definition or it's very hard to solve in a way that works for everyone.
Should be searchable:
- Statuses opted into discovery should be searchable (All public toots)
- Users' own content should be searchable (Already the case)
- Content user has interacted with (Already the case)
- Content user is following should be searchable (even unlisted & only mentioned) where this would have been shown on the home TL
- Currenly I can only search DMs I've sent so I can only search my side of a conversation. This is like having an email client that does not let me search my inbox, only my outbox.
- This is already possible but bad UX via scrolling forever and ctrl+f.
Should not be searchable by a user:
- Statuses a user is not able to see:
- unlisted and 'followers only' statuses by accounts they don't follow
- other people's DMs
Another potential compromise to reduce the number of hashtags would be a checkbox when posting that makes the whole post searchable. I would prefer it to be an instance wide setting but this is another option.
This way it would respect the users wishes of being either publicly searchable or not while avoiding hashtag spam that makes posts ugly and difficult to read.
Avoiding this:
#This is an #example #post #about #something that I #wish should be #both #readable, #searchable and #easy to find on #Mastodon.
And instead you get this:
This is an example post about something that I wish should be both readable, searchable and easy to find on Mastodon.
[✅️] Make everything in toot searchable
I would really love to see this in mastodon, my instance is primarily focussed on a online community with a lot of art, and searching would make it so much easier to find related posts or subjects.
People are not used to hashtags anymore, and likely do not want to start using them again either. I understand this is potentially abusable and allows for finding "bad" posts, but it should be an opt-in if you ask me.
I think, with appropriate extension of ActivityStream specification(e.g. introduce something like, X-Robots-Tag with enums: all, noindex, but not limited to these examples) and limiting the search to public toots would be ethically reasonable implementation for this.
If the user explicitly states "please, index my toot" using the UI (that's we currently have, in the form of meta header injection), I don't think there is no reason to forbid indexing and exposing those toots to the public.
Related issue: #4640
Also, as @smiba says, nowadays, general peoples does not like hashtagging their whole post, and most peoples are generally terrible at tagging their post with appropriate hashtags (and that greatly reduce their account discoverability).
Per @unknownconstant
- Users' own content should be searchable (Already the case)
I came here to ask how to search my own content (or add a feature to do that). So if this is possible, how exactly do you do that?
Additionally, https://github.com/w3c/activitystreams/issues/426 suggests using a custom extension of ActivityStream for flagging toots to be indexed / not indexed (or someone interested in this topic could make it to the w3c spec first)
I think, with appropriate extension of ActivityStream specification(e.g. introduce something like, X-Robots-Tag with enums: all, noindex, but not limited to these examples) and limiting the search to public toots would be ethically reasonable implementation for this.
I would love to see this; it came up during work on extended search (VyrCossont/mastodon#2 and VyrCossont/mastodon#5) that discoverable federates but noindex does not. Federating noindex and surfacing it in Mastodon client APIs would help propagate user intent to both in-Mastodon and external search. Standardizing that might be a bit of a pain given the "we're not doing it" attitude in w3c/activitypub#221, but that was years ago, and there's a lot more demand for search now.
it would be possible to consider toot:discoverable as the official signal that the user has opted into discovery features. this isn't per-post, but you can maybe consider deferring to attributedTo.discoverable if discoverable isn't set directly? fwiw part of #18212 is setting discoverable on posts as well. (and also #12178)
really the only contentious part of that is that it is unclear what the scope of such a "discovery" preference includes. historically, it referred to the profile directory. later, it was expanded to trends. i propose extending it to search, as well as public timelines (as i see the public timelines as yet another "discovery" mechanism). in the longer term, we might develop a more granular framework than just a single all-encompassing "discovery" preference, but this requires clearly defining the expected policies and how to express them. i've seen some suggestions of using ODRL for this, which i'm unsure about but it's probably worth mentioning its existence
@trwnh that's what I ended up doing for VyrCossont/mastodon#5 because it seemed like the current best option for sites that don't want to make every public post searchable. Thanks for the tip on a potential future extension to the post level.
I was surprised to see that Mastodon's AP extension docs describe discoverable as intended to cover search engines as well:
Mastodon allows users to opt-in or opt-out of discoverability features like the profile directory. This flag may also be used as an indicator of the user’s preferences toward being included in external discovery services, such as search engines or other indexing tools. If you are implementing such a tool, it is recommended that you respect this property if it is present.
But maybe this is really a docs/help text bug? As a new user, I would probably assume that the checkbox that activates noindex was going to opt me out of search, not specifically only Google-like web search crawlers, and that the one that activates discoverable opts me into the profile directory only, but since this isn't actually the case, perhaps the way to go is to update the help text to reflect its intended usage.
ODRL seems like colossal overkill, but I hadn't heard of it before today and appreciate the reference.
But maybe this is really a docs/help text bug?
not really a "bug", but more a disconnect between the property definition and the UI copy. on a federation level, the simplest definition of discoverable is "a preference to be discovered or included in discovery features". this MAY include search engines or indexers -- the exact definition of "discoverability features" or "external discovery services" is not specified. but as far as mastodon is concerned, it only uses this preference in limited ways. mastodon's "discovery features" have slowly expanded over time, and the application of this preference has been somewhat inconsistent. the UI copy focuses more on what mastodon does with the preference, and less on what the preference actually is outside of the context of any features. consequently, the checkbox gets presented as "suggest account to others" and previously as "include in the profile directory", both of which are indirections on what actually gets federated out ("this account wants to be discoverable"). i know @ClearlyClaire has expressed concerns about the unclear messaging regarding this preference, and on the challenges of (re-)obtaining consent in a federated setting.
As a new user, I would probably assume that the checkbox that activates
noindexwas going to opt me out of search, not specifically only Google-like web search crawlers
this is ambiguous due to orthogonal concerns. in UI copy it is presented as "opt out of search engine indexing", but in the backend all it does is set a robots meta-tag when the web frontend renders your profile/post permalinks. i'm not sure why it's a separate checkbox. the only reason it got added to the API at all was because of the 4.0 switch to having the webapp render permalinks (deprecating the separate HAML-controlled static HTML permalinks). afaik, it was never intended as a general preference to control all kinds of search; it was only ever for permalinks. for the activitystreams documents, indexing is generally not applicable. or, well, you could say that because of the way federation currently works, remote user-agents are almost necessarily "archiving" the resource and also "indexing" it in their databases. if you took a literal definition of "noindex" then you would not be able to lookup a post on any remote instance -- your profile would not show up and neither would your posts. this is why i think it is better to actually clearly define expectations and policies and how they are expressed. (yes, ODRL does seem like colossal overkill, but it is at least an existing standard i suppose. whether it is applicable to this issue is another matter entirely. a preliminary look at the ODRL "actions" list would suggest that perhaps we could refer to aggregate, archive, distribute, and index actions)
really the only contentious part of that is that it is unclear what the scope of such a "discovery" preference includes. historically, it referred to the profile directory. later, it was expanded to trends. i propose extending it to search, as well as public timelines (as i see the public timelines as yet another "discovery" mechanism)
That is one of the main issues (and I've seen people say they would want some of these settings not to be correlated), but another issue is how unreliable any update in the account-wide setting would be. Per-post Delete/Update propagation is already not perfect, but is orders of magnitude more likely for a server to have fetched a post at some point and not be aware of any change in the associated account's flags.
Per-post
Delete/Updatepropagation is already not perfect, but is orders of magnitude more likely for a server to have fetched a post at some point and not be aware of any change in the associated account's flags.
How about keeping it per-post, but inheriting from the account-level flag at the time the post is written. The per-post flag would be permanent, but would be consistent with post visibility.
That is one of the reasons for my proposal in #23808
Question: are search operators going to be added in this update? If so then awesome, and https://github.com/mastodon/mastodon/issues/6287 and https://github.com/mastodon/mastodon/issues/21778 can be closed.
The new full text search allows to search for all the content from accounts that opted-into being indexable (and are known to your instance).