Implement large literal/`replace` tables as a temporary table

Open GregoryTravis opened this issue 1 year ago • 1 comments

In #8578 we implemented small "literal" tables for the database backend, as part of implementing Table.replace with a Map lookup table.

For larger tables, we should upload the lookup data to a temporary table rather than encoding it as an in-query literal.

To avoid creating too many long-lived temporary tables, @radeusgd suggests:

I expect that very often when the operation is re-run, the in-memory lookup table will actually be the same between re-runs. We can exploit that and try keeping a cache of already uploaded temporary tables, indexed by the hashcode of the table's contents (uploading the table requires us to scan its whole contents anyway, so the additional cost of computing the hashcode is negligible in this case). This way we can avoid uploading a new temporary table on each run, if we can detect that a 'matching' table was already uploaded before.

We can exploit the Managed_Resource framework to try to clean up the tables once the references to them are GCed. We actually already implement a very similar feature - Hidden_Table_Registry. It is used to be able to re-use temporary hidden tables for dry-run operations, and clean them up once they are no longer needed. We could extend this registry to also support such temporary tables that are not used for dry-run (and accessed by name) but accessed by e.g. content's hash.

Feb 16 '24 18:02 GregoryTravis

On the other hand, for the bigger tables the upload may actually be costly, so maybe failing and telling the user to explicitly manually upload the table beforehand (i.e. what we currently do) is the good solution?

This way the user has more control over the execution and will not be surprised by costly uploads that may take a longer time.

Feb 16 '24 19:02 radeusgd