gerbil icon indicating copy to clipboard operation
gerbil copied to clipboard

[dataset] WEKEX'11 wrapper

Open RicardoUsbeck opened this issue 11 years ago • 5 comments

Write a wrapper for the WEKEX'11 dataset. Annotate the license, experiment type and language. Give provenance. Update https://github.com/AKSW/gerbil/wiki/Licences-for-datasets

RicardoUsbeck avatar Nov 04 '14 15:11 RicardoUsbeck

Dear @RicardoUsbeck , I found the mentioned dataset at the following link: http://nerd.eurecom.fr/ui/evaluation/wekex2011-goldenset.tar.gz

It has two csv files that carry the information about the entities listed in the articles. Following are the problems that we are facing:

  1. Since, we only have the URL of the article and not the actual text we might need to crawl them.
  2. The types of entities are given but they are very different across the documents. Sometimes Persons, Locations, Organizations are marked but in other articles, numbers have been marked.

Kindly, let me know if we should proceed with this dataset.

nikit-srivastava avatar Mar 30 '18 10:03 nikit-srivastava

Hi,

for 1) @TortugaAttack any ideas? for 2) just collect the superset of types. That should not be a problem for GERBIL if the dataset quality is bad.

RicardoUsbeck avatar Mar 30 '18 10:03 RicardoUsbeck

Alternatively, since I'm hosting this dataset, you can just ask me :-) What exactly do you need?

rtroncy avatar Mar 30 '18 11:03 rtroncy

@rtroncy What is the licence of the dataset? Do you provide the data just to specific persons or why I have to ask you? Wouldn't it be good to make it public to support the community with less borders as possible?

cO68Iy avatar Apr 02 '18 15:04 cO68Iy

When we published the dataset in 2011, we didn't think about putting a license (it was open in our mind). Later on, when we have been asked, we replied that the license was a CC BY 3.0 as many datasets like aida.

You don't have to ask me to get the data, you just have to open your eyes. As you have noticed, our ground truth annotations are available on the NERD web site. We cannot publish "as is" the original news articles, we do not own the copyright (you have the same problem with a tweet dataset where people are just providing the tweet id and you have to retrieve them). Except that we took care to pick articles from the BBC which has a special status per its legal establishment. You will always be able to retrieve yourself the original articles, from the BBC archive or via the Internet archive. Consequently, my offer of providing you the articles is just to save you some time. We clearly do everything to enable reproducible science without limitations if you take the time to check.

rtroncy avatar Apr 02 '18 19:04 rtroncy