core
core copied to clipboard
Retry when no data is returned in update for nws
Proposed change
Bump pynws version and use new feature to also retry when no data is returned during an update.
Release notes: https://github.com/MatthewFlamm/pynws/releases/tag/v1.8.0
Unlike recent PRs for nws, this one should go in a new minor release.
Type of change
- [x] Dependency upgrade
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New integration (thank you!)
- [x] New feature (which adds functionality to an existing integration)
- [ ] Deprecation (breaking change to happen in the future)
- [ ] Breaking change (fix/feature causing existing functionality to break)
- [ ] Code quality improvements to existing code or addition of tests
Additional information
- This PR fixes or closes issue: fixes #
- This PR is related to issue:
- Link to documentation pull request:
Checklist
- [x] The code change is tested and works locally.
- [x] Local tests pass. Your PR cannot be merged unless tests pass
- [x] There is no commented out code in this PR.
- [x] I have followed the development checklist
- [x] I have followed the perfect PR recommendations
- [x] The code has been formatted using Ruff (
ruff format homeassistant tests) - [ ] Tests have been added to verify that the new code works.
If user exposed functionality or configuration variables are added/changed:
- [ ] Documentation added/updated for www.home-assistant.io
If the code communicates with devices, web services, or third-party tools:
- [x] The manifest file has all fields filled out correctly.
Updated and included derived files by running:python3 -m script.hassfest. - [x] New or updated dependencies have been added to
requirements_all.txt.
Updated by runningpython3 -m script.gen_requirements_all. - [ ] For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
- [ ] Untested files have been added to
.coveragerc.
To help with the load of incoming pull requests:
- [ ] I have reviewed two other open pull requests in this repository.
Hey there @kamiyo, mind taking a look at this pull request as it has been labeled with an integration (nws) you are listed as a code owner for? Thanks!
Code owner commands
Code owners of nws can trigger bot actions by commenting:
@home-assistant closeCloses the pull request.@home-assistant rename Awesome new titleRenames the pull request.@home-assistant reopenReopen the pull request.@home-assistant unassign nwsRemoves the current integration label and assignees on the pull request, add the integration domain after the command.@home-assistant add-label needs-more-informationAdd a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.@home-assistant remove-label needs-more-informationRemove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.
Just to be certain, the intent is to NOT tag this for a patch release?
I mean it could, but this one is quality of life upgrade only. Which is why I intended it to do in a new monthly release. If you feel it can go in a patch, Im okay with that too.
Edit: the others were definitely fixes to existing issues and were more critical. This one just adds some nice to have feature.
I'm seeing unavailable after updating, but it may be because I've restarted so many times and I'm now rate limted
Will check it again later tonight
Getting these myself, which shows the usefulness of this addition. I think I need to make this draft for now anyway as the errors could be handled better in the coordinator:
[traceback shortened]
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/python/lib/python3.12/concurrent/futures/_base.py", li
ne 401, in __get_result
raise self._exception
File "/mnt/data/home-assistant/venv/lib/python3.12/site-packages/
tenacity/_asyncio.py", line 50, in __call__
result = await fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/data/home-assistant/venv/lib/python3.12/site-packages/
pynws/simple_nws.py", line 262, in update_observation
raise NwsNoDataError("Observation received with no data.")
pynws.nws.NwsNoDataError: Observation received with no data.
It did recover after a few minutes. The missing data for PHOG is not a regression.
I see that is common to catch custom errors and wrap with UpdateFailed. Ironically, the most straightforward way to put this into nws integration is to get rid of the functools.partial usage.
It did recover after a few minutes. The missing data for PHOG is not a regression.
Now that I have a potential addition for catching the error better, the server goes back up!
Error in log is much better now:
(MainThread) [homeassistant.component
s.nws] Error fetching NWS observation station PAFM data: No data re
turned.
After restart. Hourly is there, but all other data is unavailable
Will check it again in a bit
Looks okay to me on my side...
I've been running it for a few days now and getting drop outs. Checkout the logbook on the right of the below screenshots. Let me know if any logging I can enable would be helpful
Hmmm. I did find a bug in pynws that would interact with this PR, but it is only for the forecasts and not the observations that you are showing. The bug only means that this PR would cause the caching of the forecast data to get overwritten when no data is returned by NWS. This would result in behavior similar to dev branch. I will fix this and bump here as we are already bumping it here. Will mark as draft.
This looks like NWS server flakiness to me, which tends to fluctuate over time. Some time periods it is solid, some time periods there are frequent drop outs. The only way to find out with the current logging is to monitor the API calls and data. Set pynws logger to debug and you will get all API calls and data returned. It is a lot of data you will log. I would look for 500 errors in the observation coordinator refresh, usually 502, and when you get [] returned (this PR catches this for restarts).
If these drop outs are for no data, then in the current dev branch, you would have had unknown instead of unavailable.
pynws version bumped again to fix forecasts caching particularly with retries. Added PR release notes to description. This won't affect the lack of data in observations as noted above.
Will push this to my production later today
Before updating for reference. Updated now and will post results after a bit
Thank you for the detailed test, I don't have a system for long term testing like this. I will again mark as draft so I can do some detailed logging. While I still suspect this is an NWS issue, it won't be great for users here. It could be that NWS won't return data if absolutely nothing in the observation has changed. This is supposed to happen only if you pass a header value (we aren't). But I'm wondering if it still happens without the header periodically.
If this is true, then we need to catch this and set availability accordingly.
In my testing, I'm often getting no data even in the first request and random no data returned for various time periods. I don't see much rhyme or reason other than some stations are more effected than others. My local station is much more robust having only one ~7 min outage over two days, albeit on the HA release version. I am more convinced this is an NWS server issue. I think the current dev branch/released core behavior is the wrong behavior, and that this PR is a more accurate representation of the data from NWS.
On the official API page , click "Updates", under "Known Issues" has this line, which is consistent with what we are seeing:
The observations endpoints always show missing (null) wind gust and may show null values for other weather properties (temperature, wind, precipitation, etc.) at stations when the MADIS source has data.
One option is to increase the total retry interval to 15 minutes or 20 minutes, but this would have only a minor effect based on your long term testing.
I noticed the API worked in Chrome, but not via the lib.
The no data problem seems to have gone away if I change the headers https://github.com/MatthewFlamm/pynws/pull/201
Maybe I'm just getting lucky though...
Will report back tomorrow and see if it drops out overnight as its going to be very obvious if there is a difference
The no data problem seems to have gone away if I change the headers MatthewFlamm/pynws#201
This will likely be a temporary fix until the server starts messing up the cache for this new user agent again. This has happened in the past on another end point on this server. See https://github.com/weather-gov/api/discussions/492. They explicitly request developers not to use cache-busting techniques however (I'm not suggesting that you are proposing this).
My plan is to gather some key raw data and open a ticket.
I have confirmed in the past that our request interval would be nowhere near frequent enough to cause a problem (in regards to use terms, but practically I'm not sure?).
One thought for feedback... We keep track of the last successful update times (including whether data is returned) in the entry.runtime_data. Then we only raise UpdateFailed in the coordinator call when this time is >X minutes ago (say 60 minutes for now). This would prevent the shorter drop outs and still be faithful to the data source. This essentially returns us to what NWS did prior to the changes in the coordinator logic that we discussed over discord. That is, treat data as valid for X minutes even over multiple failed updates.
Most significant changes in https://github.com/home-assistant/core/pull/117109/commits/8b99d0e98464bd9fbd01399748b98c398749e160 include using a custom DataUpdateCoordinator for the observations. This enables us to internally raise an UpdateFailed only after a certain amount of time elapses from the last successful update. In my limited testing, it is much more robust to outages in NWS.
In #117533, it was observed that getting no data in HA correlated with getting 2+ hr old data from a simpler curl fetch. We are filtering out old data in NWS integration by using the start_time parameter, which is causing no data to be returned.
I'm opening this for review now. We could consider wrapping the forecast coordinators into this change too. Pending feedback, I opted to not include this since it would make this PR even larger and there is already caching done in weather.get_forecasts logic. We need to avoid double dipping. This might become important for #117254 and would be more appropriate to add to that PR IMO.
If there are no breaking changes in the library bump, please do that in a separate PR first. Thanks!
If there are no breaking changes in the library bump, please do that in a separate PR first. Thanks!
https://github.com/home-assistant/core/pull/117820
One unresolved pain point in this implementation is that if the first update fails, it will wait a full 10 minutes to make another update request. To keep the PR less unwieldy as it is already getting very lengthy, I would propose this for a future PR.
I'll put this on my production system in a bit
Looks like its better but still seems to be as long as 90 minutes before new data is returned (maybe longer)
Looks like its better but still seems to be as long as 90 minutes before new data is returned (maybe longer)
We can't fix the server not returning valid data for 90 minutes without cache busting, which is against the terms of use. This PR brings extra robustness in the HA integration while still marking the observations as invalid when applicable. The current state in the dev branch will continue to show data > 1 hr old, and users will not be able to know this.