data_tooling icon indicating copy to clipboard operation
data_tooling copied to clipboard

Create dataset cna_taiwan

Open albertvillanova opened this issue 3 years ago • 6 comments

  • uid: cna_taiwan
  • type: primary
  • description:
    • name: Central News Agency of Taiwan
    • description: The Central News Agency (CNA) is the national news agency of the Republic of China (ROC) and the most influential news organization in Taiwan.
    • homepage: https://focustaiwan.tw/aboutus
    • validated: True
  • languages:
    • language_names:
      • Chinese
    • language_comments: traditional chinese, taiwan (zh_TW)
    • language_locations:
      • Eastern Asia
      • Taiwan
    • validated: False
  • custodian:
    • name: Central News Agency of Taiwan
    • in_catalogue:
    • type: A government organization
    • location: Taiwan
    • contact_name: Central News Agency of Taiwan
    • contact_email: [email protected]
    • contact_submitter: True
    • additional: https://en.wikipedia.org/wiki/Central_News_Agency_(Taiwan)
    • validated: False
  • availability:
    • procurement:
      • for_download: No - we would need to spontaneously reach out to the current owners/custodians
      • download_url:
      • download_email: [email protected]
    • licensing:
      • has_licenses: No
      • license_text: Yes. There is previous work (2011) that used the source: https://catalog.ldc.upenn.edu/LDC2011T13
      • license_properties:
      • license_list:
    • pii:
      • has_pii: Unclear
      • generic_pii_likely:
      • generic_pii_list:
      • numeric_pii_likely:
      • numeric_pii_list:
      • sensitive_pii_likely:
      • sensitive_pii_list:
      • no_pii_justification_class: general knowledge not written by or referring to private persons
      • no_pii_justification_text:
    • validated: False
  • source_category:
    • category_type: collection
    • category_web:
    • category_media: news articles
    • validated: False
  • media:
    • category:
      • text
    • text_format:
    • audiovisual_format:
    • image_format:
    • database_format:
    • text_is_transcribed: No
    • instance_type:
    • instance_count:
    • instance_size:
    • validated: False
  • fname: cna_taiwan.json

albertvillanova avatar Nov 23 '21 10:11 albertvillanova

The superset dataset (Chinese Gigaword Fifth Edition) that contains this dataset also seems to be available from Uni Tübingen. (Access is restricted however, but it might help to reach out)

https://talar.sfb833.uni-tuebingen.de/erdora/cmdi/SFB833/INF/Corpus/Gigaword/Chinese%20Gigaword%20Fifth%20Edition

cakiki avatar Dec 05 '21 09:12 cakiki

#self-assign

cakiki avatar Dec 05 '21 09:12 cakiki

Chinese Gigaword Fifth Edition is from 2011, still better than nothing though.

cccntu avatar Dec 07 '21 15:12 cccntu

@cakiki

Tübingen has been a member of the LDC in 2011. As a member institution, we were able to receive a free copy of this data set. The standard license does not permit member institutions to share the data with researchers who are not part of the respective organization. One solution to such issues may be to turn to derived data formats that do not touch upon copyright and privacy, but this is an ongoing discussion, also within the German national research data infrastructure. These derived formats would probably not be suitable for training in machine learning. Another option for legally using the data could be to take the algorithms to the data, i.e. run for example training software at the data storage location under their license. However, this would require to carefully monitor the code for security reasons and evaluate the output to make sure that the licenses are not violated.

ttrippel avatar Dec 08 '21 08:12 ttrippel

@ttrippel Thank you for taking the time to answer!

cakiki avatar Dec 15 '21 14:12 cakiki

I think we can keep this dataset out of the target datasets for the moment...

albertvillanova avatar Jan 31 '22 13:01 albertvillanova