Scrape action should not be performed during startup when scan interval set to 0
Title says it all. My scrape sensor relies on 'params' for some parts of the resource. Because these params have not (yet) been initialized on startup, the (unwanted and unneeded !) scrape results in errors.
What should the value of the sensors be on startup in this case?
To be honest : don't care, because I don't need nor use them at that moment - that's why their scan interval is set to 0 in the first place. Or maybe a more reasonable and acceptable answer would be : restored from their previous value, as with most other sensors ? As the running of the scrape on startup has got nothing to do with any need of these sensors to be refreshed/updated ...
Would also like this. I use the resource_template: option in Multiscrape to form some URL's using an attribute from another integration. But because Multiscrape loads faster (and attempts to scrape) before that other integration loads and has a sensor value, the template renders a broken URL that results in a bunch of 404 and 500 errors every time on startup.
IMO keep the default behavior as-is but introduce a new optional boolean like scrape_on_startup: false and that way it can work regardless of users scan interval. Sensor state could be unknown so user knows that Multiscrape integration is loaded but just didn't perform scrape yet.
Glad somebody agrees with my point. Although I beg to differ with the suggestions, and stand behind my own, because :
- setting scan-interval to 0 clearly is meant to indicate that one wants to perform the scraping on one's own tempo, if and when needed, under the sole control of the user and his automations. And therefore should NOT be 'externally' forced at startup. Any other interpretation does not make sense and therefore I see no need for an additional setting.
- the same reasoning goes for the sensor values on startup. Restarting Hass is not in any way an objective reason to change the values of these sensors from their previous state - which therefore should be just retained. Or why would they have to be treated differently than e.g. the state of a light, or the state of a tempature sensor, etc ? I fail to see what influence Hass's restart could or would have on the content of the site we're scraping from, and therefore on the values we're scraping them for. And in the rather unlikely case of an extremely volatile site, one can always self-initiate a scrape on startup ...
I agree with @Paul-Vdp and I will work on implementing this. It's not a small feature request though, so it will take some time.
Much obliged @danieldotnl I realize it is not a simple change, but I am confident you will manage ;-)
@danieldotnl Any progress made on this ? Just asking ;-)
scan_interval: 0 will be a really useful feature when implemented - I wholly support it.
Please try pre-release v8.0.0!
Really appreciate you took care of my (and of others) two requests. But I am truly sorry to say that at least the problem with the unwanted scrape on startup does not seem to be resolved, as it still happens to me ... ?
No worries, that's why it is a pre-release 😊 Can you share your config and debug logging?
Sure, here you go : `multiscrape:
- name: Scrape Flights
resource: "https://www.tuifly.be/flight/nl/search?adults=2&nearByAirports=false&isOneWay=true"
params:
flyingFrom%5B%5D: "{{ states('var.flyfrom')|default('BRU') }}"
flyingTo%5B%5D: "{{ states('var.flyto')|default('HRG') }}"
depDate: "{{ states('input_datetime.scrape_date')|default(now().date()) }}"
headers:
User-Agent: "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"
scan_interval: 0
sensor:
- unique_id: scrape_flights
name: Scrape Flights
value_template: "Flights"
attributes:
- name: flights value_template: > {% set ns = namespace(flights=[]) %} {% set split = value.split('"flightViewData":')[1].split(',"seasonEndDate"')[0]|from_json %} {% for flight in split %} {% set weekDay = flight.departureDate|as_datetime|as_timestamp|timestamp_custom('%a') %} {% set detail = '{"weekDay":"' + weekDay + '", "departureDate":"' + flight.departureDate + '", "adultPrice":"' + flight.adultPrice + '", "depTime":"' + flight.journeySummary.depTime +'", "arrivalTime":"' + flight.journeySummary.arrivalTime + '", "journeyDuration":"' + flight.journeySummary.journeyDuration + '", "journeyType":"' + flight.journeySummary.journeyType + '", "arrivalAirport":"' + flight.flightsectors[0].arrivalAirport.name + '"}' %} {% set ns.flights = ns.flights + [detail] %} {% endfor %} {% set newflight = ns.flights|replace(''', '') %} {% set oldflights = state_attr('sensor.scrape_flights', 'flights') %} {% if states('var.scrape_counter')|int == 1 or oldflights == [] or oldflights == null %} {{ newflight }} {% else %} {% set flights = oldflights + newflight|from_json %} {{ flights }} {% endif %} on_error: value: last `
- unique_id: scrape_flights
name: Scrape Flights
value_template: "Flights"
attributes:
After startup, I get in the logs : `This error originated from a custom integration.
Logger: custom_components.multiscrape.entity Source: custom_components/multiscrape/entity.py:158 integration: Multiscrape (documentation, issues) First occurred: 11:29:46 (1 occurrences) Last logged: 11:29:46
Scrape Flights # Scrape Flights # flights # Unable to extract data from HTML`
which to me shows that it is still running the scrape on startup, no ?
Fixed in v8.0.1!
Getest en goed bevonden ! :-) Thanks @danieldotnl ...