openrefine.org Document which operations need a lot of RAM in the new architecture

Document which operations need a lot of RAM in the new architecture

Open wetneb opened this issue 4 years ago • 1 comments

trafficstars

In the new architecture (4.0 branch), some operations or importers have been optimized so that they do not require a lot of RAM even if the dataset is large. However as a user it is not necessarily clear which operations can safely be used in pipelines meant to run on large datasets. In some cases it might be possible for the user to find another way to carry out a transformation (or even do it externally). For this to be doable, they must be able to understand which operations should be avoided.

This applies to operations, but also importers and exporters. A given operation / importer / exporter can be efficient in some settings, an inefficient in others. For instance the CSV importer is efficient if the multiLine option is set to true but inefficient if multiLine is set to false and escaping is enabled.

Proposed solution

Document scalability of each of these components in the official manual
Consider adding some warnings to the UI before triggering an operation which might require a lot of RAM (perhaps not so easy?)

Alternatives considered

Not sure?

Aug 04 '21 14:08 wetneb

The UI warning might be easy enough depending on how we choose to display things. When trying a few options in the CSV, Line, and Fixed importers, I could definitely see the effects of some options I chose. Perhaps we add a footnote element? Like (!) next to any option that might have impact? And then towards the bottom a small description to say "(!) This option might require more RAM. You might consider handling outside OpenRefine or visit our Wiki page here for more tips" ?

Aug 04 '21 14:08 thadguidry

openrefine.org openrefine.org copied to clipboard

Document which operations need a lot of RAM in the new architecture

Proposed solution

Alternatives considered

openrefine.org
openrefine.org copied to clipboard