open-grid-emissions
open-grid-emissions copied to clipboard
Identify outlier values in reported CEMS data
We should implement some sort of outlier detection and screening for the hourly values reported in CEMS. This outlier detection could use a combination of statistical methods and physics-based methods (e.g. gross generation should not exceed nameplate capacity).
This should probably be implemented after loading the CEMS data but before any missing data imputation steps.
This may be the source of spikiness in the output data (below), so may be a priority for V1.
Blue is our result, red is the raw 930 profile (after timestamp adjustment). Both show total generation in PJM. The spikes in the blue profile are due to spikes in CEMS data.
Can you provide a little more context of what the above graph is showing?
Added above.
Two possible approaches:
- Use some multiple of the IQR to filter spikes (following http://www.nature.com/articles/s41597-020-0483-x)
- Filter values above plant capacity
This paper includes some possible approaches to identifying outlier data: https://pubs.acs.org/doi/10.1021/acs.est.9b04522
https://github.com/NREL/NaTGenPD