core icon indicating copy to clipboard operation
core copied to clipboard

Retry when no data is returned in update for nws

Open MatthewFlamm opened this issue 1 year ago • 10 comments
trafficstars

Proposed change

Bump pynws version and use new feature to also retry when no data is returned during an update.

Release notes: https://github.com/MatthewFlamm/pynws/releases/tag/v1.8.0

Unlike recent PRs for nws, this one should go in a new minor release.

Type of change

  • [x] Dependency upgrade
  • [ ] Bugfix (non-breaking change which fixes an issue)
  • [ ] New integration (thank you!)
  • [x] New feature (which adds functionality to an existing integration)
  • [ ] Deprecation (breaking change to happen in the future)
  • [ ] Breaking change (fix/feature causing existing functionality to break)
  • [ ] Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:

Checklist

  • [x] The code change is tested and works locally.
  • [x] Local tests pass. Your PR cannot be merged unless tests pass
  • [x] There is no commented out code in this PR.
  • [x] I have followed the development checklist
  • [x] I have followed the perfect PR recommendations
  • [x] The code has been formatted using Ruff (ruff format homeassistant tests)
  • [ ] Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • [x] The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • [x] New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • [ ] For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • [ ] Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

MatthewFlamm avatar May 08 '24 19:05 MatthewFlamm

Hey there @kamiyo, mind taking a look at this pull request as it has been labeled with an integration (nws) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of nws can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign nws Removes the current integration label and assignees on the pull request, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.

home-assistant[bot] avatar May 08 '24 19:05 home-assistant[bot]

Just to be certain, the intent is to NOT tag this for a patch release?

bdraco avatar May 08 '24 20:05 bdraco

I mean it could, but this one is quality of life upgrade only. Which is why I intended it to do in a new monthly release. If you feel it can go in a patch, Im okay with that too.

Edit: the others were definitely fixes to existing issues and were more critical. This one just adds some nice to have feature.

MatthewFlamm avatar May 08 '24 20:05 MatthewFlamm

I'm seeing unavailable after updating, but it may be because I've restarted so many times and I'm now rate limted

Screenshot 2024-05-08 at 3 08 49 PM Screenshot 2024-05-08 at 3 09 23 PM

Will check it again later tonight

bdraco avatar May 08 '24 20:05 bdraco

Getting these myself, which shows the usefulness of this addition. I think I need to make this draft for now anyway as the errors could be handled better in the coordinator:

[traceback shortened]
 return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/python/lib/python3.12/concurrent/futures/_base.py", li
ne 401, in __get_result
    raise self._exception
  File "/mnt/data/home-assistant/venv/lib/python3.12/site-packages/
tenacity/_asyncio.py", line 50, in __call__
    result = await fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/home-assistant/venv/lib/python3.12/site-packages/
pynws/simple_nws.py", line 262, in update_observation
    raise NwsNoDataError("Observation received with no data.")
pynws.nws.NwsNoDataError: Observation received with no data.

MatthewFlamm avatar May 08 '24 20:05 MatthewFlamm

It did recover after a few minutes. The missing data for PHOG is not a regression.

Screenshot 2024-05-08 at 3 38 14 PM Screenshot 2024-05-08 at 3 38 19 PM

bdraco avatar May 08 '24 20:05 bdraco

I see that is common to catch custom errors and wrap with UpdateFailed. Ironically, the most straightforward way to put this into nws integration is to get rid of the functools.partial usage.

MatthewFlamm avatar May 08 '24 20:05 MatthewFlamm

It did recover after a few minutes. The missing data for PHOG is not a regression.

Now that I have a potential addition for catching the error better, the server goes back up!

MatthewFlamm avatar May 08 '24 20:05 MatthewFlamm

Error in log is much better now:

(MainThread) [homeassistant.component
s.nws] Error fetching NWS observation station PAFM data: No data re
turned.

MatthewFlamm avatar May 08 '24 23:05 MatthewFlamm

After restart. Hourly is there, but all other data is unavailable

Will check it again in a bit Screenshot 2024-05-09 at 7 43 18 AM Screenshot 2024-05-09 at 7 43 24 AM

bdraco avatar May 09 '24 12:05 bdraco

Looks okay to me on my side...

MatthewFlamm avatar May 09 '24 19:05 MatthewFlamm

I've been running it for a few days now and getting drop outs. Checkout the logbook on the right of the below screenshots. Let me know if any logging I can enable would be helpful

Screenshot 2024-05-12 at 8 25 56 AM

bdraco avatar May 11 '24 23:05 bdraco

Hmmm. I did find a bug in pynws that would interact with this PR, but it is only for the forecasts and not the observations that you are showing. The bug only means that this PR would cause the caching of the forecast data to get overwritten when no data is returned by NWS. This would result in behavior similar to dev branch. I will fix this and bump here as we are already bumping it here. Will mark as draft.

This looks like NWS server flakiness to me, which tends to fluctuate over time. Some time periods it is solid, some time periods there are frequent drop outs. The only way to find out with the current logging is to monitor the API calls and data. Set pynws logger to debug and you will get all API calls and data returned. It is a lot of data you will log. I would look for 500 errors in the observation coordinator refresh, usually 502, and when you get [] returned (this PR catches this for restarts).

If these drop outs are for no data, then in the current dev branch, you would have had unknown instead of unavailable.

MatthewFlamm avatar May 12 '24 00:05 MatthewFlamm

pynws version bumped again to fix forecasts caching particularly with retries. Added PR release notes to description. This won't affect the lack of data in observations as noted above.

MatthewFlamm avatar May 13 '24 19:05 MatthewFlamm

Will push this to my production later today

bdraco avatar May 13 '24 20:05 bdraco

Before updating for reference. Updated now and will post results after a bit

Screenshot 2024-05-14 at 5 58 17 AM Screenshot 2024-05-14 at 5 58 23 AM

bdraco avatar May 13 '24 20:05 bdraco

Results after running for a bit

Screenshot 2024-05-15 at 1 24 10 PM

nws.log.txt

bdraco avatar May 15 '24 04:05 bdraco

Thank you for the detailed test, I don't have a system for long term testing like this. I will again mark as draft so I can do some detailed logging. While I still suspect this is an NWS issue, it won't be great for users here. It could be that NWS won't return data if absolutely nothing in the observation has changed. This is supposed to happen only if you pass a header value (we aren't). But I'm wondering if it still happens without the header periodically.

If this is true, then we need to catch this and set availability accordingly.

MatthewFlamm avatar May 15 '24 11:05 MatthewFlamm

In my testing, I'm often getting no data even in the first request and random no data returned for various time periods. I don't see much rhyme or reason other than some stations are more effected than others. My local station is much more robust having only one ~7 min outage over two days, albeit on the HA release version. I am more convinced this is an NWS server issue. I think the current dev branch/released core behavior is the wrong behavior, and that this PR is a more accurate representation of the data from NWS.

On the official API page , click "Updates", under "Known Issues" has this line, which is consistent with what we are seeing:

The observations endpoints always show missing (null) wind gust and may show null values for other weather properties (temperature, wind, precipitation, etc.) at stations when the MADIS source has data.

One option is to increase the total retry interval to 15 minutes or 20 minutes, but this would have only a minor effect based on your long term testing.

MatthewFlamm avatar May 15 '24 21:05 MatthewFlamm

I noticed the API worked in Chrome, but not via the lib.

The no data problem seems to have gone away if I change the headers https://github.com/MatthewFlamm/pynws/pull/201

Maybe I'm just getting lucky though...

bdraco avatar May 16 '24 23:05 bdraco

Will report back tomorrow and see if it drops out overnight as its going to be very obvious if there is a difference

Screenshot 2024-05-16 at 6 09 44 PM

bdraco avatar May 16 '24 23:05 bdraco

The no data problem seems to have gone away if I change the headers MatthewFlamm/pynws#201

This will likely be a temporary fix until the server starts messing up the cache for this new user agent again. This has happened in the past on another end point on this server. See https://github.com/weather-gov/api/discussions/492. They explicitly request developers not to use cache-busting techniques however (I'm not suggesting that you are proposing this).

My plan is to gather some key raw data and open a ticket.

I have confirmed in the past that our request interval would be nowhere near frequent enough to cause a problem (in regards to use terms, but practically I'm not sure?).

MatthewFlamm avatar May 17 '24 00:05 MatthewFlamm

One thought for feedback... We keep track of the last successful update times (including whether data is returned) in the entry.runtime_data. Then we only raise UpdateFailed in the coordinator call when this time is >X minutes ago (say 60 minutes for now). This would prevent the shorter drop outs and still be faithful to the data source. This essentially returns us to what NWS did prior to the changes in the coordinator logic that we discussed over discord. That is, treat data as valid for X minutes even over multiple failed updates.

MatthewFlamm avatar May 17 '24 20:05 MatthewFlamm

Most significant changes in https://github.com/home-assistant/core/pull/117109/commits/8b99d0e98464bd9fbd01399748b98c398749e160 include using a custom DataUpdateCoordinator for the observations. This enables us to internally raise an UpdateFailed only after a certain amount of time elapses from the last successful update. In my limited testing, it is much more robust to outages in NWS.

In #117533, it was observed that getting no data in HA correlated with getting 2+ hr old data from a simpler curl fetch. We are filtering out old data in NWS integration by using the start_time parameter, which is causing no data to be returned.

I'm opening this for review now. We could consider wrapping the forecast coordinators into this change too. Pending feedback, I opted to not include this since it would make this PR even larger and there is already caching done in weather.get_forecasts logic. We need to avoid double dipping. This might become important for #117254 and would be more appropriate to add to that PR IMO.

MatthewFlamm avatar May 20 '24 19:05 MatthewFlamm

If there are no breaking changes in the library bump, please do that in a separate PR first. Thanks!

bdraco avatar May 20 '24 19:05 bdraco

If there are no breaking changes in the library bump, please do that in a separate PR first. Thanks!

https://github.com/home-assistant/core/pull/117820

MatthewFlamm avatar May 20 '24 20:05 MatthewFlamm

One unresolved pain point in this implementation is that if the first update fails, it will wait a full 10 minutes to make another update request. To keep the PR less unwieldy as it is already getting very lengthy, I would propose this for a future PR.

MatthewFlamm avatar May 20 '24 20:05 MatthewFlamm

I'll put this on my production system in a bit

bdraco avatar May 20 '24 20:05 bdraco

Screenshot 2024-05-20 at 1 10 33 PM

Looks like its better but still seems to be as long as 90 minutes before new data is returned (maybe longer)

bdraco avatar May 20 '24 23:05 bdraco

Looks like its better but still seems to be as long as 90 minutes before new data is returned (maybe longer)

We can't fix the server not returning valid data for 90 minutes without cache busting, which is against the terms of use. This PR brings extra robustness in the HA integration while still marking the observations as invalid when applicable. The current state in the dev branch will continue to show data > 1 hr old, and users will not be able to know this.

MatthewFlamm avatar May 20 '24 23:05 MatthewFlamm