DataCleaner
DataCleaner copied to clipboard
Table lookup: Support fully loading table into memory/cache
In the Table lookup there is a good feature for enabling caching of the lookup results. This works well if you're anticipating that the same foreign key lookups are going to happen again and again. But furthermore it seems to me that you would sometimes anticipate that the foreign table is quite small and could easily fit into memory (will usually be possible up to some thousands of records, if not more).
Loading the full table into memory would have quite a lot of benefits in those scenarios. The lookups could essentially be HashMap-like fetches and the connection to the foreign database would only be queried once. As such it may even be a useable workaround for cases where the connection is a sparse resource.
This is great idea! We will need a limit to the amount of data to make sure we don't send the program tumbling. Could be a configurable option.
I'm not actually sure what would be sensible, maybe 100000/selected columns? It's not ideal, since a column is not just a column, but otherwise we need to look at the actual contents.
I actually wouldn't add a limit, unless it's because you want it to automatically pick a caching strategy based on it. But if you ask the user to make the choice then I think you shouldn't block it. After all, some people really believe in buying more and more and more memory to make stuff faster. This could be a feature that leveraged a lot of memory for huge lookup jobs (although not the primary use-case IMO).
I'd ALWAYS add a (configurable) limit, that at least throws a warning first. We don't want users to kill an application by accident simply by enabling something. Those who needs will be capable of changing the default, those who doesn't will not realize that they were the ones who broke it, but will (rightly) blame DC.
Up to you and fine by me. I only meant to say that it wasn't important to me. I think there are plenty of ways you can make DC go out of memory already, and there will always be that in any ETL-like tool I think :)
Yeah, it's not exactly hard, but let's not make it easier :-)
Whoops, closed wrong issue.