crux Crux replaces page title with site title.

Crux replaces page title with site title.

Open ciferkey opened this issue 1 year ago • 1 comments

I've been running crux over several sites and noticed the following bug.

Problem

Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816

Test based off the README example to verify the problem:

  @Test
  fun broken() {
    val crux = Crux()

    val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()

    val document = Jsoup.connect(httpUrl.toString()).get()

    val resource = runBlocking {
      crux.extractFrom(httpUrl, document)
    }

    assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
  }

The sequence of events is:

HtmlMetadataExtractor correctly extracts the right title "Ukraine anger as Macron says 'Don't humiliate Russia' - BBC News"
WebAppManifestParser extracts the title "BBC"
The fold operation in Crux.extractFrom uses Resource.plus to merge the resources overwriting the title with "BBC" https://github.com/chimbori/crux/blob/3b4586cdf983e1ae7c64b4c2b20298404253b397/src/main/kotlin/com/chimbori/crux/api/Resource.kt#L51

Possible solutions

If you update Crux.createDefaultPlugins to place WebAppManifestParser before HtmlMetadataExtractor like this:

public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
  // Static redirectors go first, to avoid getting stuck into CAPTCHAs.
  GoogleUrlRewriter(),
  FacebookUrlRewriter(),
  // Remove any tracking parameters remaining.
  TrackingParameterRemover(),
  // Prefer canonical URLs over AMP URLs.
  AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
  // Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
  WebAppManifestParser(okHttpClient),
  // Parses many standard HTML metadata attributes.
  HtmlMetadataExtractor(okHttpClient),
  // Extracts the best possible favicon from all the markup available on the page itself.
  FaviconExtractor(),
  // Parses the content of the page to remove ads, navigation, and all the other fluff.
  ArticleExtractor(okHttpClient),
)

It will produce the correct results.

This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser after HtmlMetadataExtractor or can we reorder it?

If that is not possible then we might need to consider a new way to handle merging the fields.

Aug 01 '22 21:08 ciferkey

That sounds perfect: solving via reordering the plugins is the best solution.

I didn't envision this exact scenario when writing it up, so this is a good bug that you reported.

Aug 01 '22 21:08 chimbori

crux crux copied to clipboard

Crux replaces page title with site title.

Problem

Possible solutions

crux
crux copied to clipboard