djl icon indicating copy to clipboard operation
djl copied to clipboard

Part of Speech Tagging Dataset

Open zachgk opened this issue 5 years ago • 8 comments
trafficstars

Description

This is a task to add at least one part of speech tagging dataset. These datasets help provide an example of an NLP token classification task, as well as having some use for training multi-purpose NLP models. A good example might be one from Universal Dependencies.

zachgk avatar Apr 08 '20 21:04 zachgk

Hi, I wonder if there are any other websites that also include this dataset, since the Penn Tree bank dataset in the Linguistic Data Consortium costs $1700.

AKAGIwyf avatar Apr 06 '22 03:04 AKAGIwyf

That's a good point @AKAGIwyf. I changed it to use a different POS dataset which should be freely available

zachgk avatar Apr 08 '22 16:04 zachgk

That's a good point @AKAGIwyf. I changed it to use a different POS dataset which should be freely available

We've found a version of Penn Treebank which is free on github but without POS tags as Torchtext, it had been pre-processed and I've written the code for it. I wonder if you need this kind of dataset whether or not

AKAGIwyf avatar Apr 10 '22 14:04 AKAGIwyf

@AKAGIwyf If you want to add it, more datasets are always good. It provides options for users for which one they want to train. The main goal for this issue was to add at least one POS dataset

zachgk avatar Apr 12 '22 22:04 zachgk

Hi @zachgk, I'm interested in this issue and I want to work on it, so I wonder if you can assign it to me? Thanks!

Noah-Lan avatar Apr 19 '22 15:04 Noah-Lan

@AKAGIwyf, were you working on this besides the Penn Treebank? @LanAtGitHub is interested in working on this, but I don't want to give it to a second person if you have already started

zachgk avatar Apr 19 '22 21:04 zachgk

@AKAGIwyf, were you working on this besides the Penn Treebank? @LanAtGitHub is interested in working on this, but I don't want to give it to a second person if you have already started

Hello @zachgk, I'm a teammate of @AKAGIwyf. We have had a discussion and decided to let me have a try to fix this issue.

Noah-Lan avatar Apr 20 '22 01:04 Noah-Lan

Just wanted to make sure. It's assigned @LanAtGitHub

zachgk avatar Apr 20 '22 03:04 zachgk

The above PR added a POS dataset.

siddvenk avatar Nov 10 '22 23:11 siddvenk