goq icon indicating copy to clipboard operation
goq copied to clipboard

How to properly extract unclassified parts?

Open TLINDEN opened this issue 11 months ago • 0 comments

I have the following HTML snippet:

<ul class="addetailslist">
  <li class="addetailslist--detail">
    Art<span class="addetailslist--detail--value" >
    Weitere Kinderzimmermöbel</span>
  </li>
  <li class="addetailslist--detail">
    Farbe<span class="addetailslist--detail--value" >
    Holz</span>
  </li>
  <li class="addetailslist--detail">
    Zustand<span class="addetailslist--detail--value" >
    In Ordnung</span>
  </li>
</ul>

These are 3 different attributes:

  • "Art" (en: Type) with value "Weitere Kinderzimmermöbel"
  • "Farbe" (en: Color) with value "Holz"
  • "Zustand" (en: Condition) with value "In Ordnung"

My current attempt to parse this looks like this:

type Ad struct {
  Details      []string `goquery:".addetailslist--detail--value,text"`
  [..]
}
var CONDITIONS = []string{"Neu", "Gut", "Sehr Gut", "In Ordnung"}
var COLORS = []string{"Beige", "Blau", "Braun", "Bunt", "Burgunderrot",
	"Creme", "Gelb", "Gold", "Grau", "Grün", "Holz", "Khaki", "Lavelndel",
	"Lila", "Orange", "Pink", "Print", "Rot", "Schwarz", "Silber",
	"Transparent", "Türkis", "Weiß", "Sonstige"}

[..]
	for _, detail := range advertisement.Details {
		switch {
		case slices.Contains(CONDITIONS, detail):
			advertisement.Condition = detail
		case slices.Contains(COLORS, detail):
			advertisement.Color = detail
		default:
			advertisement.Type = detail
		}
	}

So, this works, kinda.

But the obvious problem is, that it will fail if there are overlappings (e.g. a Type occuring as a Color) or if the site adds or removes values. I'd have to constantly monitor these lists and update my code.

As far as I understand the DOM, the attribute names "Art" or "Zustand" are just text values of the <li> elements. Of course I might use manual go code to parse this (using a tokenizer or regexes). But look how the string looks if I extract the whole text of the list using goquery:".addetailslist,text":

Art
                                        Weitere Kinderzimmermöbel
                                    
                                
                                        Farbe
                                        Holz
                                    
                                
                                        Zustand
                                        In Ordnung

I could try to trim it and parse it line-wise. But how stable would that be? Any tiny change might break my code.

Maybe there's a better way, do you have an idea?

any help would be much appreciated! Tom

TLINDEN avatar Feb 06 '25 19:02 TLINDEN