GeoHealthCheck icon indicating copy to clipboard operation
GeoHealthCheck copied to clipboard

GetMap/GetFeatures/GetTiles give false errors

Open jochemthart1 opened this issue 4 years ago • 0 comments

Describe the bug The probes GetMap, GetFeatures and GetTiles sometimes give false errors when there are many resources (1000) running with a short run_frequency (i.e. 5 minutes).

To Reproduce Steps to reproduce the behavior, e.g.:

  1. Add 50-100 WMS resources with GetCapabilities and GetMap probes.
  2. Set run_frequency to 1 minute.
  3. Delete all records from the run table in the database.
  4. After a while, errors like: "No WMS layers found" will appear, even though the resources work fine.

Expected Behavior This problem appears when there are multiple checks running in parallel. This should not be a problem, because GHC runs in a multi-threaded environment. Currently, the only way run GHC fluently with 1000 resources is by setting run_frequency to 240 minutes. However, I would like to have the run_frequency at a maximum of 10 minutes and preferably at 5 minutes.

Screenshots or Logfiles I cannot provide the SQLite database file it as it makes use of authentication through GitHub. Please get into contact with me if you are trying to solve this issue and need more information.

Context (please complete the following information):

  • OS: Linux/Windows (both tested)
  • Browser: Chrome/Firefox (both tested)
  • Python Version: 3.8
  • GeoHealthCheck Version: 0.8.3
  • Docker: Tried with and without docker

Additional context Something that might point in the right direction is that changing this line found at scheduler.py line 247: timedelta(minutes=random.randint(0, freq)) into this: timedelta(seconds=random.randint(0, freq*60)) makes false errors much less likely to happen.

My theory is that when creating the schedule with a clear run table, the scheduler.py will schedule all of the resources according to this line. This means that when i.e. scheduling 1000 resources using random whole minutes with 5 minutes frequency, there are 200 resources that are scheduled at almost exactly the same moment (in the same second). When changing this to random seconds, there are 60*5 possible schedule timesteps meaning +-3 per second.

The point at which I did not get any errors anymore was when schedule interval was at +-3.5s. Meaning that after starting the check for a resource, there was a period of +-3.5s before the next resource check was started.

Don't hesitate to contact me for additional information.

jochemthart1 avatar May 26 '21 14:05 jochemthart1