environmental-footprint-data
environmental-footprint-data copied to clipboard
Add a script to automatically merge multiple .csv files and deal with duplicates
We need a dedicated tool to merge merge multiple .csv files while detecting and merging duplicates.
I've started to implement it through a new static method of DeviceCarbonFootprint
:
@staticmethod
def merge(device1: 'DeviceCarbonFootprint', device2: 'DeviceCarbonFootprint',
conflict: Literal['keep2nd','interactive'] = 'keep2nd', verbose: bool = False) -> 'DeviceCarbonFootprint':
and a merge_csv.py file1 file2
standalone script written on top of the above merge
function.
By default, priority is given to device2/file2.
Conflicts are detected only for attributes that provided for both devices and when they are clearly different. If they are close enough, then merge only print a warning in verbose mode.
Then, there are two modes to resolve the conflicts:
- Simply keep device2 (and print the differences in verbose mode)
- Ask the user which version should be kept.
TODO:
- Add a non-regression mode only testing that device2 is consistent with device1 and that device1 does not contain more information.
- Cleanup and unify some entries prior to fusion to avoid false negative (i.e., CN versus China, issue #64)
- Find a way to deal with PCF files reporting the same model name whereas they are not the same (in ecodiag I also extract the model name from the main html files)
Some updates, merge_csv.py now also print a summary report like this:
PYTHONPATH=. python tools/merge_csv.py boavizta-data-us.csv dell.csv -o /dev/null
------------------------------------------------------------
| Summary report |
------------------------------------------------------------
Number of singletons: 1235, 26
Number of self duplicates: 174, 2
Number of clean fusions: 455
Number of mixed fusions: 42
Number of attributes gathered from the oldest data: 122
------------------------------------------------------------
which is handy to quickly see if there is any issues. For instance, here this report means that 1235 items of boavizta-data-us.csv are not present in dell.csv, 26 items are presents in dell.csv but not in the current db, the current db contains 174 items having one (or more) duplicates (*), among the items that are in both files, 455 are fully covered by dell.csv, but for 42 items we found attributes in boavizta-data-us.csv that are not present in dell.csv.
(*) So far duplicates are detected solely based on the model name. This implies some false positives.