jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Multiple selectors for direct descendants catches indirect descendants as well

Open Odepax opened this issue 3 years ago • 2 comments

Using org.jsoup:jsoup:1.14.3, it seems like using something like .select("> .direct > .foo, > .direct > .bar") will also select .direct > .bar.

As a work-around: .selectFirst("> .direct")!!.select("> .foo, > .bar") seems to work fine.

package bug

import org.intellij.lang.annotations.*
import org.jsoup.*
import org.junit.*
import org.junit.Assert.*

class JsoupLearningTests {
   @Test
   fun direct_descendant_bug_1() { // Fails.
      @Language("HTML")
      val html = """
         <!DOCTYPE html>
         <html lang="en">
            <head>
               <meta charset="utf-8"/>
            </head>
            <body>
               <div class="entry">
                  <div class="entry__header">
                     <div class="interesting-container">
                        <span class="interesting-item">Y</span>
                        <span class="also-interesting-item">Y</span>
                     </div>
                  </div>
                  <div class="entry__body">
                     <p> ... </p>
                     <p> ... </p>
                     <div class="sub-entry entry">
                        <div class="entry__header">
                           <div class="interesting-container">
                              <span class="interesting-item">N</span>
                              <span class="also-interesting-item">N</span>
                           </div>
                        </div>
                        <div class="entry__body">
                           <p> ... </p>
                           <p> ... </p>
                        </div>
                     </div>
                  </div>
               </div>
            </body>
         </html>
      """

      val document = Jsoup.parse(html)
      val entry = document.selectFirst(".entry")!!
      val interestingItems = entry.select("> .entry__header > .interesting-container > .interesting-item, > .entry__header > .interesting-container > .also-interesting-item")
      val actual = interestingItems.joinToString("") { it.text() }

      assertEquals("YY", actual)
   }

   @Test
   fun direct_descendant_bug_2() { // Passes.
      @Language("HTML")
      val html = """
         <!DOCTYPE html>
         <html lang="en">
            <head>
               <meta charset="utf-8"/>
            </head>
            <body>
               <div class="entry">
                  <div class="entry__header">
                     <div class="interesting-container">
                        <span class="interesting-item">Y</span>
                        <span class="interesting-item">Y</span>
                     </div>
                  </div>
                  <div class="entry__body">
                     <p> ... </p>
                     <p> ... </p>
                     <div class="sub-entry entry">
                        <div class="entry__header">
                           <div class="interesting-container">
                              <span class="interesting-item">N</span>
                              <span class="interesting-item">N</span>
                           </div>
                        </div>
                        <div class="entry__body">
                           <p> ... </p>
                           <p> ... </p>
                        </div>
                     </div>
                  </div>
               </div>
            </body>
         </html>
      """

      val document = Jsoup.parse(html)
      val entry = document.selectFirst(".entry")!!
      val interestingItems = entry.select("> .entry__header > .interesting-container > .interesting-item")
      val actual = interestingItems.joinToString("") { it.text() }

      assertEquals("YY", actual)
   }

   @Test
   fun direct_descendant_bug_3() { // Passes.
      @Language("HTML")
      val html = """
         <!DOCTYPE html>
         <html lang="en">
            <head>
               <meta charset="utf-8"/>
            </head>
            <body>
               <div class="entry">
                  <div class="entry__header">
                     <div class="interesting-container">
                        <span class="interesting-item">Y</span>
                        <span class="also-interesting-item">Y</span>
                     </div>
                  </div>
                  <div class="entry__body">
                     <p> ... </p>
                     <p> ... </p>
                     <div class="sub-entry entry">
                        <div class="entry__header">
                           <div class="interesting-container">
                              <span class="also-interesting-item">N</span>
                           </div>
                        </div>
                        <div class="entry__body">
                           <p> ... </p>
                           <p> ... </p>
                        </div>
                     </div>
                  </div>
               </div>
            </body>
         </html>
      """

      val document = Jsoup.parse(html)
      val entry = document.selectFirst(".entry")!!
      val interestingItems = entry.select("> .entry__header > .interesting-container > .also-interesting-item, > .entry__header > .interesting-container > .interesting-item")
      val actual = interestingItems.joinToString("") { it.text() }

      assertEquals("YY", actual)
   }

   @Test
   fun direct_descendant_bug_4() { // Fails.
      @Language("HTML")
      val html = """
         <!DOCTYPE html>
         <html lang="en">
            <head>
               <meta charset="utf-8"/>
            </head>
            <body>
               <div class="entry">
                  <div class="entry__header">
                     <div class="interesting-container">
                        <span class="interesting-item">Y</span>
                        <span class="also-interesting-item">Y</span>
                     </div>
                  </div>
                  <div class="entry__body">
                     <p> ... </p>
                     <p> ... </p>
                     <div class="sub-entry entry">
                        <div class="entry__header">
                           <div class="interesting-container">
                              <span class="also-interesting-item">N</span>
                           </div>
                        </div>
                        <div class="entry__body">
                           <p> ... </p>
                           <p> ... </p>
                        </div>
                     </div>
                  </div>
               </div>
            </body>
         </html>
      """

      val document = Jsoup.parse(html)
      val entry = document.selectFirst(".entry")!!
      val interestingItems = entry.select("> .entry__header > .interesting-container > .interesting-item, > .entry__header > .interesting-container > .also-interesting-item")
      val actual = interestingItems.joinToString("") { it.text() }

      assertEquals("YY", actual)
   }
}

Not sure if it's a bug or a feature: in comparison, JS's .querySelectorAll(> .direct) throws about an invalid selector.

Odepax avatar Jan 18 '22 16:01 Odepax

I along with my group will be fixing this issue in this semester.

QAQGaeBolg avatar Mar 11 '22 13:03 QAQGaeBolg

Hi, I may just find the problem. When dealing with multiple subqueries. The method consumeSubQuery will ignore the '>' of the next subquery, which means the second subquery will become like .select("> .direct > .foo") and .select(".direct > .bar") instead of the one we want like .select("> .direct > .foo") and .select("> .direct > .bar"). Hence, my method is to judge if the next is a subquery and if so, add the '>' back to the query.

ShaokangXie avatar Apr 20 '22 08:04 ShaokangXie

Thanks, fixed!

Not sure if it's a bug or a feature: in comparison, JS's .querySelectorAll(> .direct) throws about an invalid selector.

In jsoup, if the query starts with a combinator, we combine it against the root element. The root element is the Document or the context element.

jhy avatar Oct 30 '23 01:10 jhy