FAQ: How much data can OpenRefine deal with?
I'm always frustrated when OpenRefine can not deal with my data.
Proposed solution
Based on thread called "Measuring scale/limits of OpenRefine" with 1853 views in https://groups.google.com/g/openrefine/c/-loChQe4CNg/m/eroRAq9_BwAJ?pli=1 add a FAQ item that answers where OpenRefine limits come from, what sacrifices are required to lift them, what efforts are being made in this direction, and how to track, measure and join these efforts.
If full answer may appear too big, maybe it is possible to create a blog post that sums up how the CZI grant was spent in regard to scaling objective, so that other funds can decide if they want to join to support the initiative. Then how is https://github.com/OpenRefine/OpenRefine/projects?type=classic related to the goal (how much improvement in measurements are expected) and how 4.0 changes the situation (which issues are most critical).
It is interesting as well to read about technical details - how OpenRefine loses memory, that available memory is not only OpenRefine, but also eaten by the browser, if OpenRefine is able to track when memory goes into swap and productivity drops? Extremely important to see the list of features that need to be ported to support new Dataflow model, which features would need to be sacrificed, which will gain speed, which will lift memory limits, which will suffer?
Alternatives considered
Would be glad to know any.
Additional context
We were discussing the absence convenient interfaces for data wrangling и data preparation tools for big data, and if it is easier to rewrite something from scratch than try to enhance OpenRefine.
Thanks for voicing this frustration! Your suggestion of a blog post is a good one. Working on releasing 4.0, documenting and optimizing its performance is something I have been wanting to do for a long time. I have been unfortunately held back by other tasks. I hope to free myself from some of those to make space for that important work this autumn and winter.
Important issue with useful links to existing info and current efforts. I'd also add the Refine on Spark issues.
We were discussing the absence convenient interfaces for data wrangling и data preparation tools for big data
@abitrolly Who is the "we" in this context?
I'd turn the question around and ask what your needs are. Benchmarks are always built with certain sets of assumptions and it's much better if those assumptions match customers' needs. Having a defined benchmark (or set of benchmarks) makes performance evaluation tractable rather than attempting to fully explore an N dimensional space, which is, of course, impossible. Given a set of requirements one could evaluate with the existing OpenRefine, some future OpenRefine, or neither would meet your needs. Wide vs tall, mixes of data types, mixes of operations used all influence the performance characteristics of the system. Also a new backend won't address performance limitations in other areas such as the web grid display.
@tfmorris unfortunately, I can't find the context for "we". As I don't have a job and any projects generating income, it could be anything from Observable forums to Open Data Telegram channels, Google Open Source groups or random Medium article discussing data wrangling.
I would say the problem with tidying data was actual for Observable, as it is quite hard to do facets in plain JavaScript, but it is not really for big data, and things are gradually improving there with DuckDB integration and user developed interfaces. So I can't give you specific user story other than that "I've heard that OpenRefine is not for tidying big data".