Data Explorer: Search across all columns for keywords
Discussed in https://github.com/posit-dev/positron-beta/discussions/203
Originally posted by mikemahoney218 May 29, 2024
We've got a number of workflows that involve telling users to open a data frame in RStudio's View()er and searching for keywords in the data frame:
I don't see an obvious way to do this in Positron. Is there one?
Also requested by another beta user: https://github.com/posit-dev/positron-beta/discussions/203#discussioncomment-9596158
Probably a better example from the discussion:
There's nothing in the existing protocol that would let us do this since filtering is currently a per-column construct, but the user is looking for something that just lets them search across all the columns at once, like RStudio has.
I agree this is a nice "quick filter", but quite different from our existing implementation.
Taking some notes for the future on this.
In RStudio this is:
- Global search
- Immediately applies a filter operation
This would be straightforward to implement at the data explorer protocol level, but a couple of challenges
- Since the search is text-based, columns that aren't strings already have to be converted to strings, so this increases the cost
- For very large data frames, globally searching every column in the table could be arbitrarily expensive
I think probably the way to deal with this would be to have a cap on the number of "cells" that this feature is willing to search. Consider searching a column of numbers for values that have the pattern 44 (on my x86 laptop):
In [16]: s = pd.Series(np.random.randn(1000000))
In [17]: %time s.map(str).str.contains('44')
CPU times: user 428 ms, sys: 29.9 ms, total: 458 ms
Wall time: 458 ms
So searching on the order of 1 million "cells" would yield a ~half-second cost, more on slower computers. If a dataset has more than 1M cells (or whatever the upper limit is), we could do one of
- Provide a UI option for the user to opt in to the exhaustive search (useful for when that's definitely what you want, but as the default it could lead to the UI being frozen pending an expensive global search) and error / refuse to search without selecting this option when the dataset is large
- For large datasets, only display matches in the first X rows of the dataset (where X is determined based on the overall limit L divided by the number of rows)
Requested again on the discussion forum here: https://github.com/posit-dev/positron/discussions/4661
This shouldn't be too difficult to implement in the backend so whenever there is enough bandwidth to add the frontend UX for this we could definitely make it happen
Also requested in: https://github.com/posit-dev/positron/discussions/4746
Hello!
I am really enjoying the enhancements and contributions to the data explorer experience in Positron. I am curious if this "quick filter" across columns has any plans to be included in time before the first Positron official release? thanks for a great product!
We are excited to invest more in the Data Explorer @anbrav0 🎉 but this particular type of feature is likely going to need to wait until later in the year, after that initial first, non-beta release.
@jthomasmock I took a closer look at this today to come up with an implementation plan with reasonable time investment.
- A search box can be added to the upper right of the data explorer like in RStudio
- "Global search term" can be in backends as a new type of row filter. The search can apply to ALL columns or a SUBSET of columns if a column filter is active (there is no need to search columns that are not visible to the user). The filter will only determine rows that are selected. The idea is that when the UI assembles the row filters to send to the backend, it would combine the column-specific data filters with the global search filter (attaching the currently visible column indices, if relevant) and send everything to the backend to determine the visible rows.
- Once the global search filter has been applied, the front end can match search terms (case insensitive) in JavaScript and apply CSS highlighting like in RStudio to the matching data cells.
Implementing the backend pieces would be fairly straightforward (Claude Code would be happy to oblige). There is one quirk in that row filters expect there to be an associated column, but we could make the column parameter optional since the global search applies to multiple columns.
Since there was some discussion above about performance concerns. I would suggest that we implement a max_data_cells cap at some arbitrary value (e.g. 5M or 10M data cells) and by default do not search more than that to avoid the data explorer freezing for long periods of time on very large datasets. The backend would be able to indicate via the BackendState whether the search was truncated by this parameter. We would need some kind of UI treatment to opt into an exhaustive search on the dataset (which could take arbitrarily long for very large data frames)
In terms of time estimates to develop this:
- Backend portion: add new row filter type, ~1 day with help of coding agent (Claude Code)
- Frontend portion: add search box with a UI indicator to display a warning tooltip if the search was truncated because the dataset was too large, CSS highlighting of matching data cells. I am not the best person to judge, but benefiting from the work that's been done to add the search box for column filtering, but perhaps this could be achieved in ~1 week or less?