wetterdienst
wetterdienst copied to clipboard
Parameters start_date and end_date not working as expected.
Describe the bug Parameters start_date and end_date not working as expected. They don't return stations data even tough there should be data available.
To Reproduce
import pytz
from datetime import datetime, timezone, timedelta
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
DwdObservationResolution, DwdObservationPeriod
now = datetime.now(tz=pytz.timezone('Europe/Berlin')).astimezone(tz=timezone.utc)
r = DwdObservationRequest(
parameter=[
DwdObservationDataset.TEMPERATURE_AIR,
DwdObservationDataset.WIND,
DwdObservationDataset.PRECIPITATION
],
resolution=DwdObservationResolution.MINUTE_10,
period=DwdObservationPeriod.NOW,
start_date=now - timedelta(minutes=360), # <- comment out and it works
end_date=now # <- comment out and it works
)
# search weather for coordinates:
wernigerode = {'latitude': 51.8395648304923, 'longitude': 10.780834955884814}
station = r.filter_by_rank(rank=1, latlon=(wernigerode['latitude'], wernigerode['longitude']))
assert not station.values.all().df.empty, 'empty dataframe found' # <- error
You can see that data is available for the specific interval if you comment out start_date and end_date.
Expected behavior Should return data for only the time frame specified.
Desktop (please complete the following information):
- OS: [Windows]
- Python-Version [3.9]
Dear @TheAnalystx ,
apparently when I run the code I get actual data:
station_id dataset ... value quality
0 05490 temperature_air ... 99880.0 2.0
1 05490 temperature_air ... 99850.0 2.0
2 05490 temperature_air ... 99820.0 2.0
3 05490 temperature_air ... 99810.0 2.0
4 05490 temperature_air ... 99800.0 2.0
.. ... ... ... ... ...
355 05490 precipitation ... 0.0 2.0
356 05490 precipitation ... 0.0 2.0
357 05490 precipitation ... 0.0 2.0
358 05490 precipitation ... 0.0 2.0
359 05490 precipitation ... NaN NaN
Could you give more details on your environment and the request? What would now
look like when you get empty data? And at what time did you run the code?
Interesting, it works now for me too. It was late in the night, I will try to reproduce it. Maybe I can find a pattern. Thanks for your fast feedback!
Issue re-appeared, time of execution was 23:11 Berlin Time. I added the timestamps to the example
import pytz
from datetime import datetime, timezone, timedelta
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
DwdObservationResolution, DwdObservationPeriod
now = datetime.now(tz=pytz.timezone('Europe/Berlin')).astimezone(tz=timezone.utc)
# datetime.datetime(2023, 1, 29, 22, 11, 39, 501404, tzinfo=datetime.timezone.utc)
r = DwdObservationRequest(
parameter=[
DwdObservationDataset.TEMPERATURE_AIR,
DwdObservationDataset.WIND,
DwdObservationDataset.PRECIPITATION
],
resolution=DwdObservationResolution.MINUTE_10,
period=DwdObservationPeriod.NOW,
start_date=now - timedelta(minutes=360), # <- comment out and it works
# start_date = datetime.datetime(2023, 1, 29, 16, 11, 39, 501404, tzinfo=datetime.timezone.utc)
end_date=now # <- comment out and it works
# end_date = datetime.datetime(2023, 1, 29, 22, 11, 39, 501404, tzinfo=datetime.timezone.utc)
)
# search weather for coordinates:
wernigerode = {'latitude': 51.8395648304923, 'longitude': 10.780834955884814}
station = r.filter_by_rank(rank=1, latlon=(wernigerode['latitude'], wernigerode['longitude']))
assert not station.values.all().df.empty, 'empty dataframe found' # <- error
df = station.values.all().df
df['date'].min() # Timestamp('2023-01-29 00:00:00+0000', tz='UTC')
df['date'].max() # Timestamp('2023-01-29 21:50:00+0000', tz='UTC')
Thanks for the report! I also did a request just now and still got values:
station_id dataset ... value quality
0 05490 temperature_air ... 99330.0 2.0
1 05490 temperature_air ... 99310.0 2.0
2 05490 temperature_air ... 99310.0 2.0
3 05490 temperature_air ... 99290.0 2.0
4 05490 temperature_air ... 99280.0 2.0
.. ... ... ... ... ...
355 05490 precipitation ... 0.0 2.0
356 05490 precipitation ... NaN NaN
357 05490 precipitation ... NaN NaN
358 05490 precipitation ... NaN NaN
359 05490 precipitation ... NaN NaN
Did you switch off the cache for once and try the same request?
Yes, the issues persisted when I turned it of by using Settings.cache = False
I will look if I can identify the issue at some point.
Thanks for the feedback! I just ran it again and again got values:
station_id dataset ... value quality
0 05490 temperature_air ... 98630.0 2.0
1 05490 temperature_air ... 98620.0 2.0
2 05490 temperature_air ... 98620.0 2.0
3 05490 temperature_air ... 98620.0 2.0
4 05490 temperature_air ... 98590.0 2.0
.. ... ... ... ... ...
355 05490 precipitation ... 0.0 2.0
356 05490 precipitation ... 0.0 2.0
357 05490 precipitation ... 0.0 2.0
358 05490 precipitation ... NaN NaN
359 05490 precipitation ... NaN NaN
@gutzbenj
Hi a short question, does this code work for you?
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
DwdObservationResolution, DwdObservationPeriod
r = DwdObservationRequest(
parameter=[
DwdObservationDataset.TEMPERATURE_AIR,
DwdObservationDataset.WIND,
DwdObservationDataset.PRECIPITATION
],
resolution=DwdObservationResolution.MINUTE_10,
period=DwdObservationPeriod.HISTORICAL,
)
# search weather for coordinates:
hannover = {'latitude': 52.39197954397832, 'longitude': 9.80360833506706}
station = r.filter_by_rank(rank=1, latlon=(hannover['latitude'], hannover['longitude']))
df = station.values.all().df
assert not df.empty, 'empty dataframe found' # <- error
print(df)
Because I receive:
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode
coro = func()
File "<input>", line 1, in <module>
File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\core\scalar\values.py", line 755, in all
for result in tqdm(self.query(), total=len(self.sr.station_id), file=tqdm_out):
File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\tqdm\std.py", line 1195, in __iter__
for obj in iterable:
File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\core\scalar\values.py", line 449, in query
parameter_df = self._collect_station_parameter(station_id, parameter, dataset)
File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\provider\dwd\observation\api.py", line 132, in _collect_station_parameter
date_ranges = self._get_historical_date_ranges(station_id, dataset)
File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\provider\dwd\observation\api.py", line 340, in _get_historical_date_ranges
interval = pd.Interval(from_date_min, to_date_max, closed="both")
File "pandas\_libs\interval.pyx", line 325, in pandas._libs.interval.Interval.__init__
File "pandas\_libs\interval.pyx", line 345, in pandas._libs.interval.Interval._validate_endpoint
ValueError: Only numeric, Timestamp and Timedelta endpoints are allowed when constructing an Inte
I tried to create a completly new conda env because of the other error.
As for the other error:
The error is still perstisting (i even manuall deleted the cache files and set up a new environment with python 10 instead of 11). Could you maybe also try a different location?
hannover = {'latitude': 52.39197954397832, 'longitude': 9.80360833506706}
I am not sure if its now a different buggy location or if I just made a mistake when copy pasting the buggy location.
The given code also throws an error for me. It is caused because for the specific method we throw together different metadata of all requested datasets and then some stations are not existing in all datasets. I'll try to figure a solution as soon as possible. Until then try to request the datasets separately!
@gutzbenj Thank you for your answer, should I create a different ticket for the issue?
@TheAnalystx I think it falls in the same category as the skip_empty issue, so guess no separate ticket required
Dear @TheAnalystx ,
I tried working on some improvements regarding the problem of having empty data at https://github.com/earthobservations/wetterdienst/pull/889 . The main idea there is that you set skip_empty
in settings we want to get all data for all stations meaning we iterate over all stations and along the collection we increase a counter if a station has enough data and if not we just drop that station and continue with the next one.
There are currently some problems:
1.) If you request multiple datasets we will calculate the availability of data per parameter for all collected parameters of all datasets (e.g. temperature_air, precipitation and other datasets). If we have a station that has not all datasets available we try to always get a complete empty dataset available in that case however we don't do that if start_date
and end_date
is not given in the request.
This will result in the parameters not being considered empty in the calculation step of data availability simply because they are not present in the resulting dataframe. Thus tendentially the rate of available data is higher that it should be because empty datasets are not taken into account. This is not the case for parameters in the available datasets because there we have the given date range of that dataset that is being taken even for completely empty parameters.
The whole topic is quite complex to explain, should we have another chat?
Dear @gutzbenj yes lets have another chat, is there such a functionality on github? sounds complicated, but would be a huge boost in accessablity in my oppinion.
I found some solution, but if you like we can still have a chat and I may introduce you to the whole library. There's no such functionality but we could just do a skype/whatever call.
I just pushed some changes to main, so if you install now the live from from Github the code should be working for you. Just be careful: If you request to many parameters it'll probably run endlessly getting all station data because probably few stations will have the data.
You can select the criteria for missing values like skip_threshold=0.8
and skip_criteria="min"
(or "mean" or "max"),
where "min" would be the lowest availability of all parameters, "mean" would be the average of availabilities of all parameters and "max" the highest availability.
E.g. if you request "precipitation_height" and "temperature_air_mean_200" and have the following availabilities
parameter | perc |
---|---|
precipitation_height | 0.7 |
temperature_air_mean_200 | 0.9 |
the station would fail for the above setting with skip_criteria="min"
(because of precipitation_height being below the threshold) and would continue finding a station but would work with skip_criteria="mean"
and skip_criteria="max"
.