open-grid-emissions icon indicating copy to clipboard operation
open-grid-emissions copied to clipboard

Identify outlier values in reported CEMS data

Open grgmiller opened this issue 2 years ago • 4 comments

We should implement some sort of outlier detection and screening for the hourly values reported in CEMS. This outlier detection could use a combination of statistical methods and physics-based methods (e.g. gross generation should not exceed nameplate capacity).

This should probably be implemented after loading the CEMS data but before any missing data imputation steps.

grgmiller avatar Jun 07 '22 17:06 grgmiller

This may be the source of spikiness in the output data (below), so may be a priority for V1.

Blue is our result, red is the raw 930 profile (after timestamp adjustment). Both show total generation in PJM. The spikes in the blue profile are due to spikes in CEMS data.

pjm_930_comparison

gailin-p avatar Jun 29 '22 01:06 gailin-p

Can you provide a little more context of what the above graph is showing?

grgmiller avatar Jun 29 '22 15:06 grgmiller

Added above.

Two possible approaches:

  • Use some multiple of the IQR to filter spikes (following http://www.nature.com/articles/s41597-020-0483-x)
  • Filter values above plant capacity

gailin-p avatar Jun 29 '22 19:06 gailin-p

This paper includes some possible approaches to identifying outlier data: https://pubs.acs.org/doi/10.1021/acs.est.9b04522

https://github.com/NREL/NaTGenPD

grgmiller avatar Sep 08 '22 21:09 grgmiller