ha-multiscrape icon indicating copy to clipboard operation
ha-multiscrape copied to clipboard

Scrape action should not be performed during startup when scan interval set to 0

Open Paul-Vdp opened this issue 1 year ago • 8 comments

Title says it all. My scrape sensor relies on 'params' for some parts of the resource. Because these params have not (yet) been initialized on startup, the (unwanted and unneeded !) scrape results in errors.

Paul-Vdp avatar Apr 18 '24 12:04 Paul-Vdp

What should the value of the sensors be on startup in this case?

danieldotnl avatar Apr 19 '24 13:04 danieldotnl

To be honest : don't care, because I don't need nor use them at that moment - that's why their scan interval is set to 0 in the first place. Or maybe a more reasonable and acceptable answer would be : restored from their previous value, as with most other sensors ? As the running of the scrape on startup has got nothing to do with any need of these sensors to be refreshed/updated ...

Paul-Vdp avatar Apr 19 '24 14:04 Paul-Vdp

Would also like this. I use the resource_template: option in Multiscrape to form some URL's using an attribute from another integration. But because Multiscrape loads faster (and attempts to scrape) before that other integration loads and has a sensor value, the template renders a broken URL that results in a bunch of 404 and 500 errors every time on startup.

IMO keep the default behavior as-is but introduce a new optional boolean like scrape_on_startup: false and that way it can work regardless of users scan interval. Sensor state could be unknown so user knows that Multiscrape integration is loaded but just didn't perform scrape yet.

SeanPM5 avatar Apr 29 '24 08:04 SeanPM5

Glad somebody agrees with my point. Although I beg to differ with the suggestions, and stand behind my own, because :

  1. setting scan-interval to 0 clearly is meant to indicate that one wants to perform the scraping on one's own tempo, if and when needed, under the sole control of the user and his automations. And therefore should NOT be 'externally' forced at startup. Any other interpretation does not make sense and therefore I see no need for an additional setting.
  2. the same reasoning goes for the sensor values on startup. Restarting Hass is not in any way an objective reason to change the values of these sensors from their previous state - which therefore should be just retained. Or why would they have to be treated differently than e.g. the state of a light, or the state of a tempature sensor, etc ? I fail to see what influence Hass's restart could or would have on the content of the site we're scraping from, and therefore on the values we're scraping them for. And in the rather unlikely case of an extremely volatile site, one can always self-initiate a scrape on startup ...

Paul-Vdp avatar Apr 29 '24 09:04 Paul-Vdp

I agree with @Paul-Vdp and I will work on implementing this. It's not a small feature request though, so it will take some time.

danieldotnl avatar Apr 30 '24 13:04 danieldotnl

Much obliged @danieldotnl I realize it is not a simple change, but I am confident you will manage ;-)

Paul-Vdp avatar May 02 '24 10:05 Paul-Vdp

@danieldotnl Any progress made on this ? Just asking ;-)

Paul-Vdp avatar Jun 11 '24 15:06 Paul-Vdp

scan_interval: 0 will be a really useful feature when implemented - I wholly support it.

saulleighton23 avatar Jun 30 '24 04:06 saulleighton23

Please try pre-release v8.0.0!

danieldotnl avatar Sep 20 '24 20:09 danieldotnl

Really appreciate you took care of my (and of others) two requests. But I am truly sorry to say that at least the problem with the unwanted scrape on startup does not seem to be resolved, as it still happens to me ... ?

Paul-Vdp avatar Sep 22 '24 09:09 Paul-Vdp

No worries, that's why it is a pre-release 😊 Can you share your config and debug logging?

danieldotnl avatar Sep 22 '24 12:09 danieldotnl

Sure, here you go : `multiscrape:

  • name: Scrape Flights resource: "https://www.tuifly.be/flight/nl/search?adults=2&nearByAirports=false&isOneWay=true" params: flyingFrom%5B%5D: "{{ states('var.flyfrom')|default('BRU') }}" flyingTo%5B%5D: "{{ states('var.flyto')|default('HRG') }}" depDate: "{{ states('input_datetime.scrape_date')|default(now().date()) }}" headers: User-Agent: "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148" scan_interval: 0 sensor:
    • unique_id: scrape_flights name: Scrape Flights value_template: "Flights" attributes:
      • name: flights value_template: > {% set ns = namespace(flights=[]) %} {% set split = value.split('"flightViewData":')[1].split(',"seasonEndDate"')[0]|from_json %} {% for flight in split %} {% set weekDay = flight.departureDate|as_datetime|as_timestamp|timestamp_custom('%a') %} {% set detail = '{"weekDay":"' + weekDay + '", "departureDate":"' + flight.departureDate + '", "adultPrice":"' + flight.adultPrice + '", "depTime":"' + flight.journeySummary.depTime +'", "arrivalTime":"' + flight.journeySummary.arrivalTime + '", "journeyDuration":"' + flight.journeySummary.journeyDuration + '", "journeyType":"' + flight.journeySummary.journeyType + '", "arrivalAirport":"' + flight.flightsectors[0].arrivalAirport.name + '"}' %} {% set ns.flights = ns.flights + [detail] %} {% endfor %} {% set newflight = ns.flights|replace(''', '') %} {% set oldflights = state_attr('sensor.scrape_flights', 'flights') %} {% if states('var.scrape_counter')|int == 1 or oldflights == [] or oldflights == null %} {{ newflight }} {% else %} {% set flights = oldflights + newflight|from_json %} {{ flights }} {% endif %} on_error: value: last `

After startup, I get in the logs : `This error originated from a custom integration.

Logger: custom_components.multiscrape.entity Source: custom_components/multiscrape/entity.py:158 integration: Multiscrape (documentation, issues) First occurred: 11:29:46 (1 occurrences) Last logged: 11:29:46

Scrape Flights # Scrape Flights # flights # Unable to extract data from HTML`

which to me shows that it is still running the scrape on startup, no ?

Paul-Vdp avatar Sep 22 '24 13:09 Paul-Vdp

Fixed in v8.0.1!

danieldotnl avatar Sep 23 '24 19:09 danieldotnl

Getest en goed bevonden ! :-) Thanks @danieldotnl ...

Paul-Vdp avatar Sep 24 '24 09:09 Paul-Vdp