openrefine.org
openrefine.org copied to clipboard
Document which operations need a lot of RAM in the new architecture
In the new architecture (4.0 branch), some operations or importers have been optimized so that they do not require a lot of RAM even if the dataset is large. However as a user it is not necessarily clear which operations can safely be used in pipelines meant to run on large datasets. In some cases it might be possible for the user to find another way to carry out a transformation (or even do it externally). For this to be doable, they must be able to understand which operations should be avoided.
This applies to operations, but also importers and exporters.
A given operation / importer / exporter can be efficient in some settings, an inefficient in others. For instance the CSV importer is efficient if the multiLine option is set to true but inefficient if multiLine is set to false and escaping is enabled.
Proposed solution
- Document scalability of each of these components in the official manual
- Consider adding some warnings to the UI before triggering an operation which might require a lot of RAM (perhaps not so easy?)
Alternatives considered
Not sure?
The UI warning might be easy enough depending on how we choose to display things.
When trying a few options in the CSV, Line, and Fixed importers, I could definitely see the effects of some options I chose.
Perhaps we add a footnote element? Like (!) next to any option that might have impact? And then towards the bottom a small description to say "(!) This option might require more RAM. You might consider handling outside OpenRefine or visit our Wiki page here for more tips" ?