newspaper4k

newspaper4k copied to clipboard

AndyTheFactory

Reame
Issues

feat: introduce custom tags extractor

Open queukat opened this issue 10 months ago • 4 comments

Issue Link: n/a

Changes Overview:

Added a new TagsExtractor class that finds tags from
or
Injected TagsExtractor into ContentExtractor to unify custom tag extraction
Modified Article.parse() to merge extracted tags into article.tags
Updated docstrings and comments in English

Limitations:

This approach currently only looks for <a class="lnk" or rel="tag">. Might need expansions for other patterns.
No localized or language-specific logic for tags yet.

Breaking Changes:

None. This PR only adds new functionality; existing usage should be unaffected.

Testing Approach:

Manually tested with sample HTML containing
and
Verified it does not break existing extraction if these containers are not present.

Mar 06 '25 21:03 queukat

Hi @queukat ! Thanks for your contribution!

What I wonder is - why create a new class and not extend the functionality here: https://github.com/AndyTheFactory/newspaper4k/blob/c5e4170918a6d1e99cb1bab6fd188ee8ed5a2afa/newspaper/extractors/metadata_extractor.py#L164-L174

Mar 09 '25 20:03 AndyTheFactory

hey @AndyTheFactory

Short Answer

We introduced a dedicated TagsExtractor to keep “custom tag” logic separate from the standard metadata extraction that already happens in MetadataExtractor. This ensures we don’t complicate the existing logic for recognized meta/OG fields (title, description, keywords, canonical links, etc.), while still allowing us to parse specialized structures (e.g., <div class="tags-links">, <a class="lnk">, or <div id="articleTag">) that aren’t strictly part of typical metadata.

Detailed Comparison

Purpose and Scope

MetadataExtractor:

Focuses on collecting standard fields such as og:title, og:image, meta keywords, canonical links, and so on.
It has built-in definitions for recognized tags and attributes, like <meta name="description">, <meta property="og:type">, etc.

TagsExtractor:

Targets custom “tags” or site-specific containers (e.g., <div class="tags-links">), or links with rel="tag" or class "lnk".
These patterns often vary from site to site and do not necessarily appear in typical <meta> tags.

Reduced Complexity and Risk

Without a Separate Class:

Extending MetadataExtractor for special tags would couple two different responsibilities in one place (standard metadata vs. custom tag parsing).
That could introduce regressions or make future maintenance more confusing.

With TagsExtractor:

We keep the original metadata logic untouched.
If we want to add or remove patterns for custom tags later, we can do so in TagsExtractor alone, greatly reducing the chance of breaking any existing standard metadata extraction.

Single Responsibility Principle

MetadataExtractor: Responsible for standard, widely recognized metadata fields.
TagsExtractor: Focused solely on scanning containers and extracting text links that represent article “tags.”

This separation respects the Single Responsibility Principle, making each class easier to read, maintain, and test.

Flexibility and Future Proofing

MetadataExtractor: If the underlying library or standards for meta tags evolve, changes can stay isolated here.
TagsExtractor: If we need to accommodate new container styles (e.g., a new theme with <div class="post-tags">), or gather different “tag-like” items, we can evolve it independently without affecting broader metadata logic.

No Impact on Existing Usage

All the standard extraction features remain exactly as they were.
The new tags logic is optional and only runs if you call get_tags() (or however it’s integrated).
This ensures no breaking changes for existing consumers of the library.

Conclusion

Creating a standalone TagsExtractor is a good architectural choice when dealing with site- or project-specific “custom tags.” It cleanly separates concerns from MetadataExtractor, which focuses on recognized meta/OG fields.

This approach follows best practices (like the Single Responsibility Principle), keeps the code base more modular and maintainable, and avoids risk of regressions in the existing metadata extraction.

Mar 21 '25 22:03 queukat

Hi @queukat

Thanks for your compelling arguments. You are right, it would make sense to have the tags in their own extractor.

Would it not make sense to move the whole functionality of def _get_tags into the TagsExtractor ?

Mar 23 '25 13:03 AndyTheFactory

@AndyTheFactory

You're right — conceptually, it would make sense to move _get_tags entirely into TagsExtractor. That said, I’d like to clarify the intent behind the current setup.

The TagsExtractor wasn’t designed as a general-purpose or centralized solution for tag extraction. It was created specifically to handle a few niche cases where tag elements weren't being captured — in particular, two smaller sites I came across where the existing metadata logic didn’t work. So rather than bloating MetadataExtractor, this was a way to isolate site-specific logic without introducing risk or complexity to the core system.

If we start fully migrating everything tag-related into TagsExtractor, we risk repeating the same issue — just in a different place — and eventually end up with a monolithic TagsExtractor.py, defeating the purpose of separation and modularity.

So yes, moving _get_tags is structurally reasonable, but it assumes TagsExtractor is meant to be a general solution — which it currently isn’t. If the goal is to evolve it into something more universal, that could be considered, but probably needs a broader discussion on scope and ownership.

Mar 24 '25 10:03 queukat