recipe-scrapers icon indicating copy to clipboard operation
recipe-scrapers copied to clipboard

Ideas for developer experience improvements

Open jayaddison opened this issue 2 years ago • 28 comments

There are a few places where we might be able to spruce up and improve the developer experience at the moment. Faster, more consistent, fewer configuration points, more straightforward workflows - that kind of stuff.

Some examples from recent experience:

  • [x] Using tox to provide a consistent unit test and linting approach (motivation: #615) -- (status: done in #650)
  • [x] Writing some scraper developer guidance (motivation: https://github.com/hhursev/recipe-scrapers/issues/366#issuecomment-1263871073) (see #861)
  • [x] pytest: do we still need it? (status: removed in #659)
  • [x] run_tests.py: do we still need it? (status: removed in #660)
  • [x] Test settings modules - anything that we could simplify there?
  • [x] Make it easier to add multiple expected inputs & outputs for a scraper (#817 would resolve this I think)
  • [x] Simplify the issue templates?
  • [x] ~~Use the same method order in template scraper and template unit tests (motivation: see https://github.com/hhursev/recipe-scrapers/pull/673#issue-1423993837)~~ (this was a misunderstanding - Python's unittest framework sorts test case methods by name before running them)
  • [x] Enable sphinx-lint for the README file?
  • [ ] mypy: type-check method signatures on subclasses of AbstractScraper (this is most of them!) - depends on python/mypy#3903
  • [ ] Automatically detect cases where a scraper unnecessarily overrides what the abstract schema.org scraper would return -- so that we can clean up and keep code minimal.
  • [ ] Usability improvements for string filtering? (making normalize_string easier to use, and reducing repetition of it throughout the codebase, or making those code paths more Pythonic)

jayaddison avatar Oct 01 '22 18:10 jayaddison

Adding one more item: debugging plugin-related issues is tricky at the moment, and I think the reason is that it's not clear/transparent what plugins are enabled for each method call.

They're important because they catch and handle various forms of noisy web-based input for various method calls (title, ingredients, instructions in particular), but they're challenging to reason about and may produce unexpected behaviour at runtime.

Labelling every method with multiple decorators would look spammy, so I don't think that'd be a great approach. We should also bear in mind that plugins are developer-customizable -- callers may wish to opt-in or opt-out of various post-processing steps. And also we shouldn't introduce unnecessary overhead -- but nor should we allow bad/unsafe data on the basis of technical minimalism.

Taking HTML unescaping as an example: at the moment I think we HTML unescape twice for fields that we expect may contain HTML content. As far as I can tell, there's one particular site which incorrectly double-escapes HTML entities in their schema.org content, and that's why we do this. Perhaps the default should be to HTML unescape once, but allow decoration with html_unescape(rounds=2) or similar to override that per-site. And if a performance-sensitive caller felt very confident that a particular scraper required no unescaping, they could somehow override method calls to state html_unescape(rounds=0).

Basically I think that the default implementation should be "safe and correct" but with the opportunity to skip processing steps for power-users (by which I mean sophisticated power-data-users really -- sites with enough data to know where and when it makes sense to omit a step for performance reasons).

What this implies, I think, is a pipeline for each individual method. Perhaps taken to logical extremes it means that there is no "recipe scrapers", but instead a "recipe title scraper", "recipe instructions scraper", ... - and each is a defined pipeline for a {website, timerange, conditions} set. With the goal being to achieve near-100% (and accurate) coverage of known web content across all websites and timeranges.

I don't think we should attempt that yet - Python classes that wrap entire sites makes much more sense in terms of user convenience and developer attention (context-switching). But basically each method here is codifying "here's where we think that the remote data provider has put this particular item of information for a given URL", followed by "and here are the steps we wish to apply to convert that into something that we believe is safe and accurate for our users".

If we become good enough at that, we can "upstream" any errors and oddities back to the original source and basically create a positive feedback loop that reduces the number of exceptions required (while still ideally detecting them when they occur).

jayaddison avatar Oct 01 '22 23:10 jayaddison

A few more notes for a developer guide:

  • It'd be worth including recommendations about what the output of methods -- particularly instructions -- should look like
  • FAQ section?
  • Guidance about exception handling and how to debug scrapers during development

And another idea: should we have a similar code review / maintainer guide?

jayaddison avatar Oct 06 '22 15:10 jayaddison

I'm going to add some random slightly thought out things:

  • Remove schema based scrapers that have no overloads: They generate no value since they use the default schema implementation, but we have the tests that prove they work. Maybe we can just say "our schema based scraper is tested on the following websites, but tends to work everywhere with an embedded schema" and keep the tests.
  • Networking code moved to scrape_me: If scrape_me is the officially recommended way of using this package then why do we put so much code into the scraper constructors? It might help our async friends if they can write a single "networking" adapter they insert into the factory.
  • Replace requests with urllib: Not really thought out, but we don't use many features other than basic GET.
  • Use poetry: Personal bias and I know there was pushback before. I just don't like pip's version management.
  • Switch to pyproject.toml: Gotta keep up with how the language is evolving and this will clean up our codebase a touch.

@jayaddison What's the scraper that incorrectly double-escapes HTML? That might be a good individual issue.

bfcarpio avatar Oct 12 '22 06:10 bfcarpio

Remove schema based scrapers that have no overloads

I mostly agree with this, with a few small notes:

  • This could make incremental mypy progress easier, because we'd have fewer methods to check & update
  • Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers
  • We do still need a hostname-to-scraper mapping somewhere

Networking code moved to scrape_me

Thinking about scraping as a pipeline could help here. There's content retrieval (HTML/WARC/etc), then parsing, then extraction of metadata, and then mapping the metadata to output format(s) (python method calls, JSON, ...).

Scraper developers should only really have to care about the extraction part (and only when schema / microdata / foo doesn't already extract what they need). How we most easily represent that in Python classes I'm not sure. At the moment the retrieval and extraction steps are slightly bundled together.

Replace requests with urllib

Yep, sounds reasonable - although I did add responses to our unit test dependencies which may add more work for this :/

Use poetry

I'm 50/50 on that. I agree the version management and tracking could be useful. At the moment (after some lxml support issues in #629) I'm more in the mindset of removing dependencies where possible (maybe I should focus on removing the ones that I added, first..)

Switch to pyproject.toml

Agreed, makes sense. Note: we do have a MANIFEST.in file that I think is still used, let's try not to forget about that during migration.

jayaddison avatar Oct 12 '22 16:10 jayaddison

What's the scraper that incorrectly double-escapes HTML?

Unfortunately I'm not sure I kept a note of that. From a quick search, I think a few sites are affected (seriouseats, homechef, food52, ...). It's typically in recipe-related element contents, not the entire page (a developer adding an extra htmlencode(...) for safety, for example.

jayaddison avatar Oct 12 '22 16:10 jayaddison

Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers

As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?

We do still need a hostname-to-scraper mapping somewhere

What for? If a scraper can entirely be handled via wild_mode then wouldn't the client use AbstractScraper? I've forgotten if that's the correct type that's returned.

At the moment the retrieval and extraction steps are slightly bundled together.

I would think of scrape_me as the default, easy to use, batteries included retrieval implementation. If someone wants to use async code, or have their input be rabbitmq then they can write their own handler and pass it to a dedicated scraper class. I'm assuming here that the data is the entire HTML content of the page so extraction can be as normal.

Yep, sounds reasonable - although I did add responses to our unit test dependencies which may add more work for this :/

Oh, then never mind. It's not that big of a deal.

bfcarpio avatar Oct 13 '22 01:10 bfcarpio

Remove schema based scrapers that have no overloads

Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers

As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?

A hypothetical scenario: let's say that a schema-supported scraper becomes nothing more than a class with a host method.

In that case, I don't think we want new pull requests with template-generated scrapers that have a complete set of self.schema.* method calls.

(at the risk of overcomplicating things: perhaps ideally we'd want the generator to attempt to call each schema method, and only include methods for fields that were missing entirely from the input HTML schema and microdata)

We do still need a hostname-to-scraper mapping somewhere

What for?

A couple of reasons I can think of - potentially resolvable:

  • Unit tests assert on the host() value. We could determine it from the canonical URL (if the HTML contains one) although not in a @classmethod (we'd need the page data for that approach)
  • Backwards compatibility: the fact that SCRAPERS is a top-level public export from the module

jayaddison avatar Oct 13 '22 10:10 jayaddison

And after a bit more thought, two more reasons for keeping a hostname-to-scraper mapping:

  • For sites that don't include structured recipe metadata on each page, we need a way to discover the relevant scraper to use from an input URL
  • For sites where we want to apply augmentations/corrections to metadata retrieved from the page, then we need a way to override the default wild_mode results - and that also means we need to identify the relevant scraper to use

(those are both fairly important use-cases, I think)

jayaddison avatar Oct 13 '22 11:10 jayaddison

How about we do the following in '22:

  • [ ] remove the "plugins" idea altogether
  • [x] remove v12_settings
  • [ ] in case we end up with ~5 configs total, remove settings module altogether. Switch to environment flags (opposed to the current where you set environment variable that points to a "settings" file)
  • [x] remove requests. (And always lean towards less dependencies going forwards)
  • [x] switch to pyproject.toml
  • [x] simplify issues template (by half in terms of chars just to have a clear aim)

In a separate issue:

  • [x] Move away "networking" code and allow easy usage of the package with async/proxies/timeout (we should leave an example code too). no demo with javascript/browser check bypass. Just making a [new entity] that does the HTTP requests needed and gets the HTML. -> pass that down to our scrapers.

hhursev avatar Oct 16 '22 18:10 hhursev

Sounds ambitious, and good :)

About the plugins: before removing them, would it make sense to move them onto the AbstractScraper as a default list? People could still override them as necessary per-scraper, or in third-party libraries by subclassing, and I think it'd result in clearer code in this library (as in, fewer lines to read to understand what they're doing).

About settings: we only have a small number of them, right?

  • TEST_MODE - maybe we could remove this by refactoring some code
  • META_HTTP_EQUIV - does that need to be a setting? (has it caused problems?)
  • SUPPRESS_EXCEPTIONS - probably makes the library much more usable in production -- but makes dev/debugging harder. environment variable seems good there
  • PLUGINS - handled by the plugins plan

jayaddison avatar Oct 16 '22 20:10 jayaddison

  • Agree on TEST_MODE
  • If removing META_HTTP_EQUIV doesn't break any tests it shouldn't be a setting
  • Just an environment variable for SUPPRESS_EXCEPTIONS, yes

Seems like settings module should be removed 😄

Implementing plugins as a default list in AbstractScraper sounds nice to begin with - yep. Let's go down this path.

hhursev avatar Oct 16 '22 21:10 hhursev

I'd like to introduce #650 (a migration to use tox), although I'm not totally sure what everyone's opinions might be about it. Feedback appreciated!

jayaddison avatar Oct 17 '22 18:10 jayaddison

I'd like to take a stab at the networking code. I've got a couple of ideas ranging from lazily adding a callback to writing some terrible XFactoryFactory OOP code.

bfcarpio avatar Oct 19 '22 05:10 bfcarpio

in case we end up with ~5 configs total, remove settings module altogether. Switch to environment flags

From attempting some modifications, I'm going to suggest a small change to this part of the plan:

Let's reduce to exactly one recipe_scrapers.settings module, but have all of the settings within that be read from os.environ.

The benefit there is that we'll have a single place where settings are defined and read -- avoiding the multiple-config-files situation, and also avoiding multiple os imports appearing around the codebase.

jayaddison avatar Oct 19 '22 17:10 jayaddison

Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic?

class Template(AbstractScraper):

    # The self.soup object provides a way to write queries for contents on the page
    # For example: the following retrieves the contents of a 'title' HTML tag
    def title(self):
        return self.soup.find("title").get_text()

    # When a scraper field is accessed, the code in your method always runs first - it's a precise set of instructions provided by you.
    # You can also add generic extractors -- like 'schemaorg' here -- that are used if your code doesn't return a value during scraping.
    @field.schemaorg
    def ingredients(self):
        return self.soup.find(...)

    # Example: since this method is empty ('pass'), the opengraph metadata extractor will always be called
    # If opengraph extraction fails, then schemaorg extraction is called next
    @field.schemaorg
    @field.opengraph
    def image(self):
        pass

jayaddison avatar Oct 21 '22 15:10 jayaddison

A pyproject.toml migration is ready for review in #655. After comparing the poetry and setuptools approaches, and learning various backwards compatibility and workflow questions related to them, I think that setuptools is a simpler route for us to take at the moment.

jayaddison avatar Oct 22 '22 10:10 jayaddison

Let's reduce to exactly one recipe_scrapers.settings module, but have all of the settings within that be read from os.environ.

I'd vote to make it just a settings.py/config.py file then and remove the RecipeScraperSettings class (whose idea was to allow users to write more complex settings (dict/list structures etc. through a file) and also settings change to take immediate effect w/o the need to restart the program/script using recipe_scrapers). We don't need that. FWIW I'd pick config.py for the naming.

Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic?

I don't see it as improvement. In my opinion:

class Template(AbstractScraper):
    # .....
   
    def ingredients(self):
        return self.schema.ingredients()

    def image(self):
        return super().opengraph_image()

is good enough. If schema/opengraph image is there it will work. If not devs will adjust. I see no real downsides.

After comparing the poetry and setuptools approaches

let's not switch to poetry for now (ever?). I personally don't see much/any gains in our case. Open for discussion (I use poetry in other project I'm part of and see no real gain using it here).

hhursev avatar Oct 22 '22 23:10 hhursev

I'm going to take some time out from maintenance/dev work here for a week or two. I feel like I've been commenting a lot and thinking/attempting a lot of stuff, but it's felt kinda disjointed or rushed at times.

Moving some of the larger/breaking changes here into the v15 branch makes sense I think - makes it easier to develop all that without worrying so much about maintaining compatibility. Hopefully in a few weeks I'll feel a bit refreshed to code up some more stuff.

jayaddison avatar Oct 28 '22 17:10 jayaddison

Hey @jayaddison take as much time as you need. v15 won't be merged in without your approval :wink:

You've done a great job in the last month! Thank you very much. The pyproject.toml introduction, the speeding of the CI :star:, the schema.org parser updates, merging incoming PRs in a speedy manner and addressing so much of the points in this issue. Prototyping different stuff and being prompt with communication on top of all that.

I really respect the work and time you spend on this project! Kudos to eager people like you supporting OSS.

hhursev avatar Oct 29 '22 12:10 hhursev

remove the "plugins" idea altogether

Maybe there'd be an alternative implementation, but after getting back into the code again recently - I do think that the plugins are useful. #705 looks like a quick fix made possible by the SchemaOrg plugin, for example.

jayaddison avatar Dec 17 '22 13:12 jayaddison

Typing: Considering most developers are using IDEs that make use of type-hinting, it would be helpful if the base API were implemented with type signatures throughout.

jayhale avatar Jan 02 '23 22:01 jayhale

Maybe controversial opinion: I think that the v15 branch should include a breaking change where instructions() returns type list[str] (instead of a newline-joined str), mirroring ingredients().

Making those two methods more equivalent should, I think, make the developer experience easier, and also I think it's a slightly better caller/user experience.

I won't lie: sometimes it's difficult to extract multi-item instructions from websites. But from experience in #852 recently, I think it's possible. In rare cases where it's not, or is too difficult, we can return a list containing a single item.

jayaddison avatar Sep 14 '23 11:09 jayaddison

I've pushed commit 0b87a5141a0d5497566425110d6d7947ad1bc76f to the v15 branch -- it removes the scrape_me interface, meaning that the library doesn't have any network dependencies.

Is that good/bad? Is it a developer experience improvement? I'm not really sure. It's a bit annoying to have to retrieve some HTML from a URL, and then pass both the HTML and the URL for scraping. Not terrible, but compared to one import and one method call, it's kinda involved.

jayaddison avatar Oct 03 '23 17:10 jayaddison

Move away "networking" code and allow easy usage of the package with async/proxies/timeout (we should leave an example code too). no demo with javascript/browser check bypass. Just making a [new entity] that does the HTTP requests needed and gets the HTML. -> pass that down to our scrapers.

I've published a release candidate, v15.0.0-rc1 that includes this change. I'll begin using that for RecipeRadar soon so that it gets some production testing.

jayaddison avatar Nov 30 '23 13:11 jayaddison

Ah, no I haven't - v15.0.0-rc1 is completely network-isolated, but the suggestion there was to include a utility method to do the HTTP retrieval. I'll do that for rc2 (it'll re-add requests as an optional dependency, and I think that's fine).

jayaddison avatar Nov 30 '23 13:11 jayaddison

Also: the scrape_html method can probably be simplified further in v15 to remove some of the proxy/timeout logic. That can move into the suggested utility method.

jayaddison avatar Nov 30 '23 13:11 jayaddison

* Remove schema based scrapers that have no overloads: They generate no value since they use the default schema implementation, but we have the tests that prove they work. Maybe we can just say "our schema based scraper is tested on the following websites, but tends to work everywhere with an embedded schema" and keep the tests.

@bfcarpio this might be super straightforward now that we use data-driven tests (#944). Would you like to take a look at that? I think a large amount of code could be cleaned up.

jayaddison avatar Mar 13 '24 16:03 jayaddison

I'll admit, I haven't worked on this repo in a long time. I'll be quite busy the next couple months as well. Funny how life challenges your priorities.

Yes, I would like to have this work done and reduce the code, but I can't commit to anything.

bfcarpio avatar Mar 14 '24 16:03 bfcarpio

I think that the tasks from here are complete-enough that this can be closed; the mypy type-hinting inference for subclass methods would be very nice, but may take a while to arrive and can be handled in future as a separate task. Changes to normalize_string likely require a more detailed specification and/or specific bugreports that they address.

Thank you all for your input :)

On a tangential note: I plan to release a final (non-release-candidate) of v15 within the next few days.

jayaddison avatar Jul 15 '24 09:07 jayaddison