crux
crux copied to clipboard
Crux replaces page title with site title.
I've been running crux over several sites and noticed the following bug.
Problem
Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816
Test based off the README example to verify the problem:
@Test
fun broken() {
val crux = Crux()
val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()
val document = Jsoup.connect(httpUrl.toString()).get()
val resource = runBlocking {
crux.extractFrom(httpUrl, document)
}
assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
}
The sequence of events is:
- HtmlMetadataExtractor correctly extracts the right title "Ukraine anger as Macron says 'Don't humiliate Russia' - BBC News"
- WebAppManifestParser extracts the title "BBC"
- The fold operation in
Crux.extractFrom
usesResource.plus
to merge the resources overwriting the title with "BBC" https://github.com/chimbori/crux/blob/3b4586cdf983e1ae7c64b4c2b20298404253b397/src/main/kotlin/com/chimbori/crux/api/Resource.kt#L51
Possible solutions
If you update Crux.createDefaultPlugins
to place WebAppManifestParser
before HtmlMetadataExtractor
like this:
public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
// Static redirectors go first, to avoid getting stuck into CAPTCHAs.
GoogleUrlRewriter(),
FacebookUrlRewriter(),
// Remove any tracking parameters remaining.
TrackingParameterRemover(),
// Prefer canonical URLs over AMP URLs.
AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
// Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
WebAppManifestParser(okHttpClient),
// Parses many standard HTML metadata attributes.
HtmlMetadataExtractor(okHttpClient),
// Extracts the best possible favicon from all the markup available on the page itself.
FaviconExtractor(),
// Parses the content of the page to remove ads, navigation, and all the other fluff.
ArticleExtractor(okHttpClient),
)
It will produce the correct results.
This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser
after HtmlMetadataExtractor
or can we reorder it?
If that is not possible then we might need to consider a new way to handle merging the fields
.
That sounds perfect: solving via reordering the plugins is the best solution.
I didn't envision this exact scenario when writing it up, so this is a good bug that you reported.