feat: introduce custom tags extractor
Issue Link: n/a
Changes Overview:
- Added a new
TagsExtractorclass that finds tags fromor
Hi @queukat ! Thanks for your contribution!
What I wonder is - why create a new class and not extend the functionality here: https://github.com/AndyTheFactory/newspaper4k/blob/c5e4170918a6d1e99cb1bab6fd188ee8ed5a2afa/newspaper/extractors/metadata_extractor.py#L164-L174
hey @AndyTheFactory
Short Answer
We introduced a dedicated TagsExtractor to keep “custom tag” logic separate from the standard metadata extraction that already happens in MetadataExtractor. This ensures we don’t complicate the existing logic for recognized meta/OG fields (title, description, keywords, canonical links, etc.), while still allowing us to parse specialized structures (e.g., <div class="tags-links">, <a class="lnk">, or <div id="articleTag">) that aren’t strictly part of typical metadata.
Detailed Comparison
Purpose and Scope
MetadataExtractor:
- Focuses on collecting standard fields such as
og:title,og:image,meta keywords,canonical links, and so on. - It has built-in definitions for recognized tags and attributes, like
<meta name="description">,<meta property="og:type">, etc.
TagsExtractor:
- Targets custom “tags” or site-specific containers (e.g.,
<div class="tags-links">), or links withrel="tag"or class"lnk". - These patterns often vary from site to site and do not necessarily appear in typical
<meta>tags.
Reduced Complexity and Risk
Without a Separate Class:
- Extending
MetadataExtractorfor special tags would couple two different responsibilities in one place (standard metadata vs. custom tag parsing). - That could introduce regressions or make future maintenance more confusing.
With TagsExtractor:
- We keep the original metadata logic untouched.
- If we want to add or remove patterns for custom tags later, we can do so in
TagsExtractoralone, greatly reducing the chance of breaking any existing standard metadata extraction.
Single Responsibility Principle
- MetadataExtractor: Responsible for standard, widely recognized metadata fields.
- TagsExtractor: Focused solely on scanning containers and extracting text links that represent article “tags.”
This separation respects the Single Responsibility Principle, making each class easier to read, maintain, and test.
Flexibility and Future Proofing
- MetadataExtractor: If the underlying library or standards for meta tags evolve, changes can stay isolated here.
-
TagsExtractor: If we need to accommodate new container styles (e.g., a new theme with
<div class="post-tags">), or gather different “tag-like” items, we can evolve it independently without affecting broader metadata logic.
No Impact on Existing Usage
- All the standard extraction features remain exactly as they were.
- The new tags logic is optional and only runs if you call
get_tags()(or however it’s integrated). - This ensures no breaking changes for existing consumers of the library.
Conclusion
Creating a standalone TagsExtractor is a good architectural choice when dealing with site- or project-specific “custom tags.” It cleanly separates concerns from MetadataExtractor, which focuses on recognized meta/OG fields.
This approach follows best practices (like the Single Responsibility Principle), keeps the code base more modular and maintainable, and avoids risk of regressions in the existing metadata extraction.
Hi @queukat
Thanks for your compelling arguments. You are right, it would make sense to have the tags in their own extractor.
Would it not make sense to move the whole functionality of def _get_tags into the TagsExtractor ?
@AndyTheFactory
You're right — conceptually, it would make sense to move _get_tags entirely into TagsExtractor. That said, I’d like to clarify the intent behind the current setup.
The TagsExtractor wasn’t designed as a general-purpose or centralized solution for tag extraction. It was created specifically to handle a few niche cases where tag elements weren't being captured — in particular, two smaller sites I came across where the existing metadata logic didn’t work. So rather than bloating MetadataExtractor, this was a way to isolate site-specific logic without introducing risk or complexity to the core system.
If we start fully migrating everything tag-related into TagsExtractor, we risk repeating the same issue — just in a different place — and eventually end up with a monolithic TagsExtractor.py, defeating the purpose of separation and modularity.
So yes, moving _get_tags is structurally reasonable, but it assumes TagsExtractor is meant to be a general solution — which it currently isn’t. If the goal is to evolve it into something more universal, that could be considered, but probably needs a broader discussion on scope and ownership.