parser icon indicating copy to clipboard operation
parser copied to clipboard

selecting an attribute doesn't seem to work

Open thoraxe opened this issue 5 years ago • 0 comments

  • Platform: Linux t490s-festive-local 5.3.18-300.fc31.x86_64 #1 SMP Wed Dec 18 20:13:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Mercury Parser Version: 2.2.0
  • Node Version (if a Node bug): v12.15.0
  • Browser Version (if a browser bug): ??

Expected Behavior

https://moneymaven.io/mishtalk/economics/lie-of-the-day-this-is-not-a-pandemic-CdOIoPAmbEyglh3Ls6RXKQ

export const MoneymavenIoExtractor = {
  domain: 'moneymaven.io',

  title: {
    selectors: [
      'article h1'
    ],
  },

  date_published: {
    selectors: [
      ['meta[name="build:date"]', 'content'],
    ],
  },

  content: {
    selectors: [
      'article'
    ],
  },
}

Using [meta[name="build:date"]','content'] should extract the value:

<meta name="build:date" content="2020-02-22 00:49:13 +0000">

Current Behavior

In the test, the value is not extracted:

  ● MoneymavenIoExtractor › initial test case › returns the date_published                                                                                                                                                                    
                                                                                                                                                                                                                                              
    AssertionError [ERR_ASSERTION] [ERR_ASSERTION]: null == '2020-02-22 00:49:13 +0000'                                                                                                                                                       
                                                                                                                                                                                                                                              
      47 |     // Update these values with the expected values from                                                                                                                                                                           
      48 |     // the article.                                                                                                                                                                                                                
    > 49 |     assert.equal(date_published, '2020-02-22 00:49:13 +0000')                                                                                                                                                                      
         |            ^                                                                                                                                                                                                                       
      50 |   });                                                                                                                                                                                                                              
      51 |                                                                                                                                                                                                                                    
      52 |     it('returns the content', async () => {                                                                                                                                                                                        
                                                                                                                                                                                                                                              
      at Object.equal (src/extractors/custom/moneymaven.io/index.test.js:49:12)                                                                                                                                                               
      at tryCatch (node_modules/regenerator-runtime/runtime.js:62:40)                                                                                                                                                                         
      at Generator.invoke [as _invoke] (node_modules/regenerator-runtime/runtime.js:288:22)                                                                                                                                                   
      at Generator.prototype.<computed> [as next] (node_modules/regenerator-runtime/runtime.js:114:21)                                                                                                                                        
      at asyncGeneratorStep (src/extractors/custom/moneymaven.io/index.test.js:17:103)                                                                                                                                                        
      at _next (src/extractors/custom/moneymaven.io/index.test.js:19:194)  

Steps to Reproduce

See above

Detailed Description

Using $$('meta[name="build:date"]'); in the browser finds only one element. It's not clear why the parser isn't picking it up (see NULL in above test output).

Is this user error?

Other

This site is pretty terrible and appears to intentionally leave things unlabeled. I'm not sure I'll ever be able to provide a valid parser that grabs everything for it. I'll probably carry something in a local fork.

I am using mercury via https://github.com/feedbin/extract

thoraxe avatar Feb 26 '20 12:02 thoraxe