wetterdienst Parameters start_date and end_date not working as expected.

Describe the bug Parameters start_date and end_date not working as expected. They don't return stations data even tough there should be data available.

To Reproduce

import pytz
from datetime import datetime, timezone, timedelta
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod
now = datetime.now(tz=pytz.timezone('Europe/Berlin')).astimezone(tz=timezone.utc)
r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.NOW,
            start_date=now - timedelta(minutes=360),  # <- comment out and it works
            end_date=now  # <- comment out and it works
        )
# search weather for coordinates:
wernigerode = {'latitude': 51.8395648304923, 'longitude': 10.780834955884814}
station = r.filter_by_rank(rank=1, latlon=(wernigerode['latitude'], wernigerode['longitude']))
assert not station.values.all().df.empty, 'empty dataframe found'  # <- error

You can see that data is available for the specific interval if you comment out start_date and end_date.

Expected behavior Should return data for only the time frame specified.

Desktop (please complete the following information):

OS: [Windows]
Python-Version [3.9]

Jan 29 '23 02:01 TheAnalystx

Dear @TheAnalystx ,

apparently when I run the code I get actual data:

station_id          dataset  ...    value quality
0        05490  temperature_air  ...  99880.0     2.0
1        05490  temperature_air  ...  99850.0     2.0
2        05490  temperature_air  ...  99820.0     2.0
3        05490  temperature_air  ...  99810.0     2.0
4        05490  temperature_air  ...  99800.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      0.0     2.0
357      05490    precipitation  ...      0.0     2.0
358      05490    precipitation  ...      0.0     2.0
359      05490    precipitation  ...      NaN     NaN

Could you give more details on your environment and the request? What would now look like when you get empty data? And at what time did you run the code?

Jan 29 '23 10:01 gutzbenj

Interesting, it works now for me too. It was late in the night, I will try to reproduce it. Maybe I can find a pattern. Thanks for your fast feedback!

Jan 29 '23 14:01 TheAnalystx

Issue re-appeared, time of execution was 23:11 Berlin Time. I added the timestamps to the example

import pytz
from datetime import datetime, timezone, timedelta
from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod

now = datetime.now(tz=pytz.timezone('Europe/Berlin')).astimezone(tz=timezone.utc)
# datetime.datetime(2023, 1, 29, 22, 11, 39, 501404, tzinfo=datetime.timezone.utc)

r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.NOW,
            start_date=now - timedelta(minutes=360),  # <- comment out and it works
            # start_date = datetime.datetime(2023, 1, 29, 16, 11, 39, 501404, tzinfo=datetime.timezone.utc)
            end_date=now  # <- comment out and it works
            # end_date = datetime.datetime(2023, 1, 29, 22, 11, 39, 501404, tzinfo=datetime.timezone.utc)
        )

# search weather for coordinates:
wernigerode = {'latitude': 51.8395648304923, 'longitude': 10.780834955884814}
station = r.filter_by_rank(rank=1, latlon=(wernigerode['latitude'], wernigerode['longitude']))

assert not station.values.all().df.empty, 'empty dataframe found'  # <- error
df = station.values.all().df
df['date'].min()  # Timestamp('2023-01-29 00:00:00+0000', tz='UTC')
df['date'].max()  # Timestamp('2023-01-29 21:50:00+0000', tz='UTC')

Jan 29 '23 22:01 TheAnalystx

Thanks for the report! I also did a request just now and still got values:

station_id          dataset  ...    value quality
0        05490  temperature_air  ...  99330.0     2.0
1        05490  temperature_air  ...  99310.0     2.0
2        05490  temperature_air  ...  99310.0     2.0
3        05490  temperature_air  ...  99290.0     2.0
4        05490  temperature_air  ...  99280.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      NaN     NaN
357      05490    precipitation  ...      NaN     NaN
358      05490    precipitation  ...      NaN     NaN
359      05490    precipitation  ...      NaN     NaN

Did you switch off the cache for once and try the same request?

Jan 29 '23 22:01 gutzbenj

Yes, the issues persisted when I turned it of by using Settings.cache = False I will look if I can identify the issue at some point.

Jan 31 '23 12:01 TheAnalystx

Thanks for the feedback! I just ran it again and again got values:

 station_id          dataset  ...    value quality
0        05490  temperature_air  ...  98630.0     2.0
1        05490  temperature_air  ...  98620.0     2.0
2        05490  temperature_air  ...  98620.0     2.0
3        05490  temperature_air  ...  98620.0     2.0
4        05490  temperature_air  ...  98590.0     2.0
..         ...              ...  ...      ...     ...
355      05490    precipitation  ...      0.0     2.0
356      05490    precipitation  ...      0.0     2.0
357      05490    precipitation  ...      0.0     2.0
358      05490    precipitation  ...      NaN     NaN
359      05490    precipitation  ...      NaN     NaN

Jan 31 '23 22:01 gutzbenj

@gutzbenj

Hi a short question, does this code work for you?

from wetterdienst.provider.dwd.observation import DwdObservationRequest, DwdObservationDataset, \
    DwdObservationResolution, DwdObservationPeriod
r = DwdObservationRequest(
            parameter=[
                DwdObservationDataset.TEMPERATURE_AIR,
                DwdObservationDataset.WIND,
                DwdObservationDataset.PRECIPITATION
            ],
            resolution=DwdObservationResolution.MINUTE_10,
            period=DwdObservationPeriod.HISTORICAL,
        )
# search weather for coordinates:
hannover = {'latitude': 52.39197954397832, 'longitude': 9.80360833506706}
station = r.filter_by_rank(rank=1, latlon=(hannover['latitude'], hannover['longitude']))
df = station.values.all().df
assert not df.empty, 'empty dataframe found'  # <- error
print(df)

Because I receive:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\core\scalar\values.py", line 755, in all
    for result in tqdm(self.query(), total=len(self.sr.station_id), file=tqdm_out):
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\tqdm\std.py", line 1195, in __iter__
    for obj in iterable:
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\core\scalar\values.py", line 449, in query
    parameter_df = self._collect_station_parameter(station_id, parameter, dataset)
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\provider\dwd\observation\api.py", line 132, in _collect_station_parameter
    date_ranges = self._get_historical_date_ranges(station_id, dataset)
  File "D:\Anaconda3\envs\dwd_weather_3_10\lib\site-packages\wetterdienst\provider\dwd\observation\api.py", line 340, in _get_historical_date_ranges
    interval = pd.Interval(from_date_min, to_date_max, closed="both")
  File "pandas\_libs\interval.pyx", line 325, in pandas._libs.interval.Interval.__init__
  File "pandas\_libs\interval.pyx", line 345, in pandas._libs.interval.Interval._validate_endpoint
ValueError: Only numeric, Timestamp and Timedelta endpoints are allowed when constructing an Inte

I tried to create a completly new conda env because of the other error.

As for the other error: The error is still perstisting (i even manuall deleted the cache files and set up a new environment with python 10 instead of 11). Could you maybe also try a different location? hannover = {'latitude': 52.39197954397832, 'longitude': 9.80360833506706} I am not sure if its now a different buggy location or if I just made a mistake when copy pasting the buggy location.

Feb 06 '23 13:02 TheAnalystx

The given code also throws an error for me. It is caused because for the specific method we throw together different metadata of all requested datasets and then some stations are not existing in all datasets. I'll try to figure a solution as soon as possible. Until then try to request the datasets separately!

Feb 12 '23 21:02 gutzbenj

@gutzbenj Thank you for your answer, should I create a different ticket for the issue?

Feb 14 '23 13:02 TheAnalystx

@TheAnalystx I think it falls in the same category as the skip_empty issue, so guess no separate ticket required

Feb 19 '23 14:02 gutzbenj

Dear @TheAnalystx ,

I tried working on some improvements regarding the problem of having empty data at https://github.com/earthobservations/wetterdienst/pull/889 . The main idea there is that you set skip_empty in settings we want to get all data for all stations meaning we iterate over all stations and along the collection we increase a counter if a station has enough data and if not we just drop that station and continue with the next one.

There are currently some problems: 1.) If you request multiple datasets we will calculate the availability of data per parameter for all collected parameters of all datasets (e.g. temperature_air, precipitation and other datasets). If we have a station that has not all datasets available we try to always get a complete empty dataset available in that case however we don't do that if start_date and end_date is not given in the request.

This will result in the parameters not being considered empty in the calculation step of data availability simply because they are not present in the resulting dataframe. Thus tendentially the rate of available data is higher that it should be because empty datasets are not taken into account. This is not the case for parameters in the available datasets because there we have the given date range of that dataset that is being taken even for completely empty parameters.

The whole topic is quite complex to explain, should we have another chat?

Feb 19 '23 21:02 gutzbenj

Dear @gutzbenj yes lets have another chat, is there such a functionality on github? sounds complicated, but would be a huge boost in accessablity in my oppinion.

Feb 24 '23 01:02 TheAnalystx

I found some solution, but if you like we can still have a chat and I may introduce you to the whole library. There's no such functionality but we could just do a skype/whatever call.

Feb 24 '23 22:02 gutzbenj

I just pushed some changes to main, so if you install now the live from from Github the code should be working for you. Just be careful: If you request to many parameters it'll probably run endlessly getting all station data because probably few stations will have the data.

You can select the criteria for missing values like skip_threshold=0.8 and skip_criteria="min" (or "mean" or "max"), where "min" would be the lowest availability of all parameters, "mean" would be the average of availabilities of all parameters and "max" the highest availability.

E.g. if you request "precipitation_height" and "temperature_air_mean_200" and have the following availabilities

parameter	perc
precipitation_height	0.7
temperature_air_mean_200	0.9

the station would fail for the above setting with skip_criteria="min" (because of precipitation_height being below the threshold) and would continue finding a station but would work with skip_criteria="mean" and skip_criteria="max".

Feb 26 '23 20:02 gutzbenj

wetterdienst wetterdienst copied to clipboard

Parameters start_date and end_date not working as expected.

wetterdienst
wetterdienst copied to clipboard