Remove `DataFiles` table from `TransformationDB`
Looking into the performance of the TransformationSystem, and its DB in particular, the hotest spot is the DataFiles table.
The aim of this table is to deduplicate the LFN in the DB, so if multiple transformations are applied to the same file, the LFN is only stored once in this DataFiles, and the TransformationFiles just refers to it via foreign key.
When a lot of transformations are running, the DataFiles table can get big (currently 80M rows in LHCb). Queries we are running against it are of this type:
SELECT LFN,FileID FROM DataFiles WHERE LFN in ('a', 'b', 'c')
They can take up to half an hour in our case.
Effectively, the DataFiles:
- is inefficient at querying (which we do very often, even to insert new files)
- subject to race condition (the code tries to protect it at various places, but still)
I propose to remove the DataFiles table, and add an indexed LFN column to the TransformationFiles table. It may make the DB slightly bigger in size, but the performance will be dramatically improved.