DataCleaner icon indicating copy to clipboard operation
DataCleaner copied to clipboard

Table lookup: Support fully loading table into memory/cache

Open kaspersorensen opened this issue 8 years ago • 6 comments

In the Table lookup there is a good feature for enabling caching of the lookup results. This works well if you're anticipating that the same foreign key lookups are going to happen again and again. But furthermore it seems to me that you would sometimes anticipate that the foreign table is quite small and could easily fit into memory (will usually be possible up to some thousands of records, if not more).

Loading the full table into memory would have quite a lot of benefits in those scenarios. The lookups could essentially be HashMap-like fetches and the connection to the foreign database would only be queried once. As such it may even be a useable workaround for cases where the connection is a sparse resource.

kaspersorensen avatar Jun 01 '16 04:06 kaspersorensen

This is great idea! We will need a limit to the amount of data to make sure we don't send the program tumbling. Could be a configurable option.

I'm not actually sure what would be sensible, maybe 100000/selected columns? It's not ideal, since a column is not just a column, but otherwise we need to look at the actual contents.

LosD avatar Jun 01 '16 06:06 LosD

I actually wouldn't add a limit, unless it's because you want it to automatically pick a caching strategy based on it. But if you ask the user to make the choice then I think you shouldn't block it. After all, some people really believe in buying more and more and more memory to make stuff faster. This could be a feature that leveraged a lot of memory for huge lookup jobs (although not the primary use-case IMO).

kaspersorensen avatar Jun 01 '16 16:06 kaspersorensen

I'd ALWAYS add a (configurable) limit, that at least throws a warning first. We don't want users to kill an application by accident simply by enabling something. Those who needs will be capable of changing the default, those who doesn't will not realize that they were the ones who broke it, but will (rightly) blame DC.

LosD avatar Jun 01 '16 17:06 LosD

Up to you and fine by me. I only meant to say that it wasn't important to me. I think there are plenty of ways you can make DC go out of memory already, and there will always be that in any ETL-like tool I think :)

kaspersorensen avatar Jun 01 '16 23:06 kaspersorensen

Yeah, it's not exactly hard, but let's not make it easier :-)

LosD avatar Jun 02 '16 04:06 LosD

Whoops, closed wrong issue.

LosD avatar Jul 06 '16 06:07 LosD