OmegaWiki: Integrate additional POS tags / remove multibyte chars / region labels / sense index
Tried the conversion of a newer OmegaWiki dump.
1. There are UTF16-multibyte characters which cannot be handled by the XML components
we use. Remove them for now.
2. There are additional part of speech tags that are currently not part of the POS
mapping (e.g., for the lemma "but"). Add more mapping entries.
3. The converter creates semantic labels of type "regionofUsage". While this
often contains information on the diatopic variety, it is a free text field and also
includes other label types and longer explanations that are not corresponding to our
definition of label. Examples are:
* http://www.omegawiki.org/Expression:Mietze - label: verniedlichend
* http://www.omegawiki.org/Expression:usw. - label: Vor usw. steht in Aufzählungen
kein Komma. Es heißt also nicht "Bananen, Äpfel, usw.", sondern "Bananen,
Äpfel usw."
** cf. http://uby:8080/uby-browser/entry/OW_deu_LexicalEntry_24644
4. Currently Sense.index contains internal OmegaWiki IDs which should be encoded in
MonolingualExtRef. The index should be a running number corresponding to the natural
sense order defined by the Lexicon.
Original issue reported on code.google.com by chmeyer.de on 2014-10-09 09:01:03
I can take care of some or all of these issues. Please let me know if you already started
working on any of these.
Original issue reported on code.google.com by matu011235 on 2014-10-09 11:28:27
Committed my changes. Feel free to review. Regarding the labels, it might be interesting
to manually classify them (possibly restrict a selection by lengths and/or frequency)
- there are valuable, but untyped semantic labels hidden in the annotations. Since
this sounds like labor-intensive work, I'll leave it for future work ;-)
Original issue reported on code.google.com by chmeyer.de on 2014-10-09 13:12:38
Changes look good, I agree that handling the labels can be delay for now. However, I'm
not quite happy wiht excluding UTF-16 characters - according to the XML specification,
any XML processor should be able to handle that: http://www.w3.org/TR/xml/#charsets
Maybe we can look into that again later on, it's not an urgent issue I guess.
Original issue reported on code.google.com by matu011235 on 2014-10-10 05:41:06
Agree. For clarification: Not all UTF16 characters are removed, but values that contain
a UTF16 multibyte character (i.e., a character requiring 32 bit for display). I assume
that some UTF8-UTF16-UTF8 conversion is missing in the process reading from DB - processing
in Java - writing to XML file. Should be looked into. So far, I removed the values
as the converter fails with exception otherwise.
Original issue reported on code.google.com by chmeyer.de on 2014-10-10 07:35:29
unfortunately, changing this:
4. Currently Sense.index contains internal OmegaWiki IDs which should be encoded in
MonolingualExtRef. The index should be a running number corresponding to the natural
sense order defined by the Lexicon.
broke the import classes where OW alignments are imported, e.g.
OmegaWikiCrossLingualAlignment and
@Michael:
would it be much effort to rewrite the problematic line? and could you do that?
otherwise the import can not continue
List<Sense> first = ubySource.getSensesByOWSynTransId(""+source.getSyntransid());
getSensesByOWSynTransId not does not work any more, because the index attribute no
longer contains the required value
Original issue reported on code.google.com by eckle.kohler on 2014-10-16 09:10:33
I don't know how much effort it is, by I will look into it.
Original issue reported on code.google.com by matu011235 on 2014-10-16 10:51:40
I committed the changes. Please check and close the bug if the issue is resolved.
Original issue reported on code.google.com by matu011235 on 2014-10-16 11:16:03
thanks! I will check it tomorrow (first item on the agenda ;)
Original issue reported on code.google.com by eckle.kohler on 2014-10-16 18:28:33
updated OmegaWikiCrossLingualAlignment to new externalSystem value
Original issue reported on code.google.com by eckle.kohler on 2014-10-17 07:49:08
is fixed for the "lite" import
changes might still be necessary for the import of Wikipedia - OW alignments
Original issue reported on code.google.com by eckle.kohler on 2014-10-17 11:41:37
I updated the "medium import" as well - new externalSystem Value in OmegaWikiWiktionaryAlignment
Original issue reported on code.google.com by eckle.kohler on 2014-10-20 13:30:49
(No text was entered with this change)
Original issue reported on code.google.com by eckle.kohler on 2014-11-07 09:29:49
- Labels added: Milestone-0.7.0
- Labels removed: Milestone-0.6.0
(No text was entered with this change)
Original issue reported on code.google.com by richard.eckart on 2015-02-18 21:11:45
- Labels added: Module-integration.omegawiki
(No text was entered with this change)
Original issue reported on code.google.com by chmeyer.de on 2015-04-10 08:57:50
- Labels added: Milestone-0.8.0
- Labels removed: Milestone-0.7.0