WiktionaryParser icon indicating copy to clipboard operation
WiktionaryParser copied to clipboard

[Norwegian] Some pages are not being scrapped properly

Open C0rn3j opened this issue 7 years ago • 20 comments
trafficstars

Out of all the issues I opened here this one is the most important to me as I've used this project for creation of a Kindle-compatible dictionary, and incomplete/missing entries are the bane of every dictionary project ^^


https://en.wiktionary.org/wiki/seg#Norwegian_Bokm%C3%A5l Missing completely

[{"etymology": "", "definitions": [], "pronunciations": {"text": [], "audio": []}}]

https://en.wiktionary.org/wiki/ham#Norwegian_Bokm%C3%A5l Missing completely

[{"etymology": "", "definitions": [], "pronunciations": {"text": ["IPA: /h\u0251m/"], "audio": []}}]

https://en.wiktionary.org/wiki/by#Norwegian_Bokm%C3%A5l Missing the verb definition

[
	{
		"etymology": "From Old Norse býr (“place (to camp or settle), land, property, lot; and later settlement”).\n",
		"definitions": [
			{
				"partOfSpeech": "noun",
				"text": "by m (definite singular byen, indefinite plural byer, definite plural byene)\n\ntown, city (regardless of population size or land area)\n",
				"relatedWords": [
					{
						"relationshipType": "derived terms",
						"words": [
							"bydel",
							"byfornyelse, byfornying",
							"bygdeby",
							"bymessig",
							"bystat",
							"bystatus",
							"drabantby",
							"ferieby",
							"gamleby",
							"havneby",
							"hjemby",
							"landsby",
							"Mexico by",
							"naboby",
							"spøkelsesby",
							"storby"
						]
					}
				],
				"examples": []
			}
		],
		"pronunciations": {
			"text": [],
			"audio": []
		}
	},
	{
		"etymology": "From byde, from Old Norse bjóða, from Proto-Germanic *beudaną (“to offer”), from Proto-Indo-European *bʰewdʰ- (“to wake, rise up”).\n",
		"definitions": [],
		"pronunciations": {
			"text": [],
			"audio": []
		}
	}
]

Here's a list of errors from my project for words in Norwegian Bokmål. It is totally possible that some errors are due to a mistake in my own scripts, but all I checked were thrown due to WiktionaryParser not parsing them properly or at all.

https://haste.rys.pw/raw/vevafamiwo

Another half-broken entry -

https://en.wiktionary.org/wiki/for#Norwegian_Bokm%C3%A5l

C0rn3j avatar Jul 06 '18 21:07 C0rn3j

Seems fixed in 39ba27422ab33e104d0f034df1e62848a5229c48

suyashb95 avatar Jul 15 '18 13:07 suyashb95

Seems fixed indeed. Thank you a LOT.

Is there anywhere I can send you a few bucks to? Paypal?

C0rn3j avatar Jul 15 '18 13:07 C0rn3j

Appreciate it but, it's a hobby project so that's not necessary :D

suyashb95 avatar Jul 15 '18 13:07 suyashb95

And your hobby project is incredibly helpful to me, so if you change your mind and I ever see a donation page/button on the main page, I'll use it ^^


Actually found one more under løsrive, it's missing the inflection part - https://en.wiktionary.org/wiki/l%C3%B8srive#Norwegian_Bokm%C3%A5l

[
  {
    "etymology": "From løs +‎ rive",
    "definitions": [
      {
        "partOfSpeech": "verb",
        "text": "(often reflexive, with seg / oneself)\nto break away\nto detach (oneself)\nto tear oneself away (fra / from)\nto secede (fra / from)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

EDIT: And one more in Bokmål - øl - it strips the first inflection line

https://en.wiktionary.org/wiki/%C3%B8l#Norwegian_Bokm%C3%A5l

[
  {
    "etymology": "From Old Norse ǫl, from Proto-Germanic *alu, from Proto-Indo-European *h₂elut- (“beer”).\n",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "øl m (definite singular ølen, indefinite plural øl, definite plural ølene) (a glass, bottle or can of beer)\n\nbeer (alcoholic drink)\na beer (in a glass, bottle or can)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [
        "IPA: /œl/",
        "Rhymes: -œl"
      ],
      "audio": []
    }
  }
]

C0rn3j avatar Jul 17 '18 06:07 C0rn3j

Inflections seem to be turning up properly now, although they're a part of the definition text itself

suyashb95 avatar Aug 04 '18 14:08 suyashb95

Amazing, looking forwards to a new release ^^

C0rn3j avatar Aug 04 '18 16:08 C0rn3j

That seems to have broken more than it fixed.

konkurs in Norwegian Bokmål in 0.0.8:

[
  {
    "etymology": "From Latin concursus",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": "konkurs (indeclinable)\n\nbankrupt\n",
        "relatedWords": [],
        "examples": [
          "gå konkurs - go bankrupt"
        ]
      },
      {
        "partOfSpeech": "noun",
        "text": "konkurs (indeclinable)\n\nbankrupt\nkonkurs m (definite singular konkursen, indefinite plural konkurser, definite plural konkursene)\n\na bankruptcy\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

and after in 0.0.91:

[
  {
    "etymology": "From Latin concursus",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": "konkurs (indeclinable)\n\nbankrupt\n",
        "relatedWords": [],
        "examples": [
          "gå konkurs - go bankrupt"
        ]
      },
      {
        "partOfSpeech": "noun",
        "text": "konkurs m (definite singular konkursen, indefinite plural konkurser, definite plural konkursene)\n\na bankruptcy\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

heis in 0.0.91 has a duped entry

[
  {
    "etymology": "From the verb heise",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "heis m (definite singular heisen, indefinite plural heiser, definite plural heisene)\n\nelevator (US), lift (UK)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  },
  {
    "etymology": "From the verb heise",
    "definitions": [
      {
        "partOfSpeech": "verb",
        "text": "heis m (definite singular heisen, indefinite plural heiser, definite plural heisene)\n\nelevator (US), lift (UK)\nheis\nimperative of heise\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

Here's more that broke for testing (first word of every line is what the entry is for, this is a diff)-

image

C0rn3j avatar Aug 04 '18 19:08 C0rn3j

Whoops, added a fix in another release

suyashb95 avatar Aug 05 '18 02:08 suyashb95

Okay, that looks much better, just a few things.

My scripts operate on the assumption that the inflections are before the first line break. Am unsure if that was true for every word in 0.0.8, but it certainly was for 99.9%+ of them.

In 0.0.92 this is now not the case with bor and handful of other entries, like faksimile, while it seems it gets otherwise scrapped correctly, it adds line breaks between the two inflection lines. Is this by design and should I write some different kind of detection? It didn't use to be that way until now, think it was just a space in the other words.

image

image

Other than that it seems to have broken a single word - pantergaupe, which is now missing the inflection part.

[
  {
    "etymology": "panter +‎ gaupe",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "Iberian lynx; Lynx pardinus\n",
        "relatedWords": [
          {
            "relationshipType": "synonyms",
            "words": [
              "iberisk gaupe",
              "spansk gaupe"
            ]
          }
        ],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [
        "IPA: /pan.ter.ɡæʉ.pe/, [ˈpɑn.təɾ.ˌɡæʉ̯ː.pə]"
      ],
      "audio": []
    }
  }
]

C0rn3j avatar Aug 05 '18 13:08 C0rn3j

Some of the inflections are in multiple lines so they'll be parsed that way. I've gonna fix inflection parsing for other words like pantergaupe in the dev branch for now. I'm experimenting with having definitions in a list of sentences instead of one long string, let's see if that works.

suyashb95 avatar Aug 05 '18 14:08 suyashb95

Ohhhh you're totally right! Never noticed nor realized this would be the problem.

image

I skimmed my definition list and apparently this was already an issue I was not handling. Your fix just made it more visible.

C0rn3j avatar Aug 05 '18 22:08 C0rn3j

Not sure if same problem as pantergaupe but maldivisk is missing the inflection line in the second definition(0.0.92).

https://en.wiktionary.org/wiki/maldivisk#Norwegian_Bokm%C3%A5l

image

BTW: I rewrote the detection part of my script, it seems to be working great, thanks for the fixes!

C0rn3j avatar Aug 13 '18 16:08 C0rn3j

Added some changes in 2ba2eea7d34d8e2ae57633210e648f6054d600ab to fix this. Also, the definition text is now a list so you may have to change your script

suyashb95 avatar Aug 22 '18 16:08 suyashb95

Finally kicked myself to work on my script again, changes look awesome, thanks!

C0rn3j avatar Sep 08 '18 23:09 C0rn3j

Okay I only looked at my inflections output, premature celebration.

Your changes at some point seemed to have added garbage in the form of the word name to some words.

https://en.wiktionary.org/wiki/forrevet forrevet has a definition 'forrevet' which really shouldn't be there for example.

[
  {
    "etymology": "",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": [
          "forrevet (indefinite singular forrevet, definite singular and plural forrevne)",
          "alternative form of forreven",
          "forrevet",
          "neuter singular of forreven"
        ],
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

https://en.wiktionary.org/wiki/foreskrevet has the exact same issue and am sure there's a bunch of others

C0rn3j avatar Sep 08 '18 23:09 C0rn3j

I haven't encountered multiple subheadings under a definition yet. The subheadings usually contain inflections so the parser adds that to the list of definitions. I guess it should either not include them or separate them out from the definition list, probably in a field called word/inflections in the JSON

suyashb95 avatar Sep 09 '18 03:09 suyashb95

Yeah, it should separate it, or not do that, as I can't simply filter out if word X contains definition X because some words really are that way (best in bokmål means best).

If you need more examples where this happens - støvete, uomskåret,

C0rn3j avatar Sep 09 '18 09:09 C0rn3j

It looks like one of the updates also broke nested definitions

https://en.wiktionary.org/wiki/v%C3%A6re_glad_i

image

They weren't exactly scrapped perfectly in the first place it seems, but now they're not scrapped at all.

image

C0rn3j avatar Sep 20 '18 10:09 C0rn3j

Nested definitions and examples have ambiguous formatting so figuring that out is going to take some time

suyashb95 avatar Sep 23 '18 12:09 suyashb95

I've had luck with the Wiktionary contributors willing to redo old formatting and use a newer template for some snowflake definitions I ran into.

Not sure if these nested words are the case, I could ask about them, but that'd require me to go through the diff and pick them out, which right now has a lot of "garbage" I mentioned above, and it'd be a pain to go through it in this state.

image

C0rn3j avatar Sep 23 '18 14:09 C0rn3j