data_tooling icon indicating copy to clipboard operation
data_tooling copied to clipboard

Create dataset s2orc_the_semantic_scholar_open_research_corpus

Open albertvillanova opened this issue 4 years ago • 10 comments

  • uid: s2orc_the_semantic_scholar_open_research_corpus
  • type: primary
  • description:
    • name: S2ORC: The Semantic Scholar Open Research Corpus
    • description: Largest collection of machine-readable English-language open-access scientific literature formatted to support NLP research. 136M papers with titles and abstracts, including 12.7M papers with full text. Unifies popular resources like PubMed Central (Biomedicine) and arXiv (Physics, Math, CS) with papers sourced across many different academic disciplines. Maintained by the Semantic Scholar Research team at AI2. https://aclanthology.org/2020.acl-main.447/
    • homepage: https://github.com/allenai/s2orc
    • validated: True
  • languages:
    • language_names:
      • English
    • language_comments:
    • language_locations:
      • World-Wide
    • validated: False
  • custodian:
    • name: Semantic Scholar / Allen Institute for AI
    • in_catalogue:
    • type: A nonprofit/NGO (other)
    • location: United States of America
    • contact_name: Kyle Lo
    • contact_email: [email protected]
    • contact_submitter: True
    • additional: http://allenai.org/
    • validated: False
  • availability:
    • procurement:
      • for_download: Yes - after signing a user agreement
      • download_url: https://docs.google.com/forms/d/1fUqUw68dDMnzFt58WgMi-FI33MPcVFpflN2G3Yjfn9c/edit
      • download_email:
    • licensing:
      • has_licenses: Yes
      • license_text:
      • license_properties:
        • non-commercial use
      • license_list:
        • cc-by-nc-2.0: Creative Commons Attribution Non Commercial 2.0 Generic
    • pii:
      • has_pii: Yes
      • generic_pii_likely: very likely
      • generic_pii_list:
        • names
        • email addresses
        • physical addresses
        • URLs
        • website account name or handle
      • numeric_pii_likely: somewhat likely
      • numeric_pii_list:
        • telephone numbers
      • sensitive_pii_likely: unlikely
      • sensitive_pii_list:
      • no_pii_justification_class:
      • no_pii_justification_text:
    • validated: False
  • source_category:
    • category_type: collection
    • category_web:
    • category_media: scientific articles/journal
    • validated: False
  • media:
    • category:
      • text
    • text_format:
      • .XHTML
      • .TXT
      • .CSV
      • .TEX
      • other
      • .JSON
    • audiovisual_format:
    • image_format:
      • other
      • .PDF
    • database_format:
      • .TAR
      • .JSON
      • .GZIP
      • .TGZ
    • text_is_transcribed: Yes - image
    • instance_type: article
    • instance_count: 1M<n<1B
    • instance_size: 100<n<10,000
    • validated: False
  • fname: s2orc_the_semantic_scholar_open_research_corpus.json

albertvillanova avatar Nov 23 '21 10:11 albertvillanova

cc @kyleclo

yjernite avatar Nov 26 '21 16:11 yjernite

#self-assign

j-chim avatar Dec 01 '21 15:12 j-chim

Exists in the hub. However based on the data card, the current version is likely v20190928 rather than the latest one. This means the dataset is missing the updates detailed here.

j-chim avatar Dec 04 '21 00:12 j-chim

Thanks @j-chim.

Link: https://huggingface.co/datasets/s2orc

The version of the canonical dataset has date 2020-12-01:

_ROOT_URL = "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/2020-12-01/"

albertvillanova avatar Jan 06 '22 16:01 albertvillanova

Sample:

{'id': '984774e366d3d4fcf0fce659f697478dccd4f93c',
 'title': 'Fully convolutional architecture vs sliding-window CNN for corneal endothelium cell segmentation',
 'paperAbstract': 'BackgroundCorneal endothelium (CE) images provide valuable clinical information regarding the health state of the cornea. Computation of the clinical morphometric parameters requires the segmentation of endothelial cell images. Current techniques to image the endothelium in vivo deliver low quality images, which makes automatic segmentation a complicated task. Here, we present two convolutional neural networks (CNN) to segment CE images: a global fully convolutional approach based on U-net, and a local sliding-window network (SW-net). We propose to use probabilistic labels instead of binary, we evaluate a preprocessing method to enhance the contrast of images, and we introduce a postprocessing method based on Fourier analysis and watershed to convert the CNN output images into the final cell segmentation. Both methods are applied to 50 images acquired with an SP-1P Topcon specular microscope. Estimates are compared against a manual delineation made by a trained observer.ResultsU-net (AUC=0.9938) yields slightly sharper, clearer images than SW-net (AUC=0.9921). After postprocessing, U-net obtains a DICE=0.981 and a MHD=0.22 (modified Hausdorff distance), whereas SW-net yields a DICE=0.978 and a MHD=0.30. U-net generates a wrong cell segmentation in only 0.48% of the cells, versus 0.92% for the SW-net. U-net achieves statistically significant better precision and accuracy than both, Topcon and SW-net, for the estimates of three clinical parameters: cell density (ECD), polymegethism (CV), and pleomorphism (HEX). The mean relative error in U-net for the parameters is 0.4% in ECD, 2.8% in CV, and 1.3% in HEX. The computation time to segment an image and estimate the parameters is barely a few seconds.ConclusionsBoth methods presented here provide a statistically significant improvement over the state of the art. U-net has reached the smallest error rate. We suggest a segmentation refinement based on our previous work to further improve the performance.',
 'entities': [],
 's2Url': 'https://semanticscholar.org/paper/984774e366d3d4fcf0fce659f697478dccd4f93c',
 'pdfUrls': ['https://bmcbiomedeng.biomedcentral.com/track/pdf/10.1186/s42490-019-0003-2'],
 's2PdfUrl': '',
 'authors': [{'name': 'Juan P. Vigueras-Guillén', 'ids': ['1404113169']},
  {'name': 'Busra  Sari', 'ids': ['113782037']},
  {'name': 'Stanley F. Goes', 'ids': ['1419382500']},
  {'name': 'Hans G. Lemij', 'ids': ['82955974']},
  {'name': 'Jeroen  van Rooij', 'ids': ['143638168']},
  {'name': 'Koenraad A. Vermeer', 'ids': ['1693033']},
  {'name': 'Lucas J. van Vliet', 'ids': ['134339884']}],
 'inCitations': ['c62f7bfce0aee11940925f4619d9f9be60c7b5ba',
  '0d1b8f40bfa29acd1ef6ced824bc89f74426422c',
  '8d4e0d120887cba389ad8897b660e8c663de6bb6',
  'ec3aad2e8c5ab6bdd3802cb7d147684541dc4631',
  'a38b350c48aa54b4905ce09877231949eb9ef96d',
  '61aa49a2cf81075bf7e2d081aa897a78f91b35c1',
  '178502d887d652758d33e453a85c85db3114cc1f',
  'e37385e5196dc2a884bea325bd939006f6eaf8ff',
  'f513c0e665f02c23b7cc04492ce161865c84785a',
  'bb8c06e6d41175417a951aaa9c59691b1a74cf59',
  'e999babf70ceb7382233fa17f5d838fc12080a25'],
 'outCitations': ['1366de5bb112746a555e9c0cd00de3ad8628aea8',
  '8ea4aac5f2995bbf8480018ecafa0b317d70177c',
  'b34f3795e3ef3167706e923024c40584d2aba0d9',
  'e225684e3172b2cbb41ed1b9d0791232043bf6eb',
  'f19284f6ab802c8a1fcde076fcb3fba195a71723',
  '168bb1336676ff7658425a8609b8ac5b8b10aaa2',
  'c3108c14216c0c80609743a20f5cea244f62620a',
  '4d376d6978dad0374edfa6709c9556b42d3594d3',
  '3c1e44605135ff2656cd122c8e158dfc7313cef8',
  '1cba60bd2d8bec7b0a02bbd6ddcbc2d59aa69da2',
  '6bc81f58f93b76b62042f510fb9f39ea5480802c',
  '2f754f83d9c92322924c4180dd549890d7e50352',
  '5fa91b23cef2dcc07cebe34a8b6e89e44a7e3e50',
  '0f84a81f431b18a78bd97f59ed4b9d8eda390970',
  'a6cb366736791bcccc5c8639de5a8f9636bf87e8',
  '29e03ddae4edf90d49defe4143151411e7822cd6',
  'fdd3d4c4c9c8a3866f860ff7c23393de39d098a1',
  '317aee7fc081f2b137a85c4f20129007fd8e717e',
  '5f8bc6319446f38d62fb154505312115e7f59baf',
  '6364fdaa0a0eccd823a779fcdd489173f938e91a',
  '3c09bb369efdcae72f4d2aee8c25efb473864162',
  'a5d36aa713a02851276aab629f2c4797aa0749a3',
  '876c4856b41efc07581ddc0a9c6c63d1598e43f2',
  '48378c5f5a472def51401bd14e2ee131141bc064',
  '44c481f3f17a8b1e9d1965ad37b1f88eca252779',
  'c02a962e75d570dde44ba3a2f0ea7616e8d1a78f',
  '49dbeef8655afd3a7abb0fcfb7a2e27da8c09d2d',
  'eafb457dde5c76a789071067a76864c283586e51',
  '5e83ab70d0cbc003471e87ec306d27d9c80ecb16',
  '5562a56da3a96dae82add7de705e2bd841eb00fc',
  '49b178278a6ade55938ce2d5398cd7a1f9b3e922',
  'fcf43325529c8b1cc26aeb52fd5d7e532abb0a40',
  '09193e19b59fc8f05bee9d6efbfb1607ca5b6501',
  '62766c143e0ca5974a3ddf6fdfa6472862af8970',
  '10c0f65c03f9d7fa120205d4da97977df988d0c8',
  '58ad20f90181c920b60d06fe80519f4551ad2030',
  '15227b55dd33c6a3285af680fb21194c039df3e8',
  '3dc1e5fbf7842c214554aac02343cfd1b44ea435',
  '23045299013e8738bc8eff73827ef8de256aef66',
  '34f25a8704614163c4095b3ee2fc969b60de4698',
  '5e1d7732f7e3fa229feb7248d0dc763b8abac0a5',
  '42cac845ecb43000186f1d8ad4aacc2e2d0ce3cc',
  'f51ac2bdcc2addfd2fe540ac483ae4d54016400e',
  '77d1b1faf44e7f0ce7b36fcc9e0c2d3296178dd9',
  '13db654a257b1f094a6eb07d1f6378ae34f54163',
  'e794ba3dca412f432a8e99f9a84b1f6514b42829',
  'd9dab7574d56ae81efe6c90c213c6509b36cf950',
  'abd1c342495432171beb7ca8fd9551ef13cbd0ff',
  '09ef0911c4579fe14ae1c1b7ba8156cc43bd440e',
  'a8e8f3c8d4418c8d62e306538c9c1292635e9d27',
  '4fb85c05adbaf101a781c0ccc78017be41d47d17',
  '6a22e8ddce1eee044f990e1ac3b7ec0fb77cca0e',
  '8f35c1b4c4cc06176350d827a555e06cc86d3f67',
  'b0d64e7135d043185aee8b89a89372fb79deed94',
  'f7237cc2f7fdcf462e8b0b4c2c6e645facdb0e0b',
  '9f265054bed707c49a11214126f975cc565d3279',
  '5c8d575d43f24c86698181e42e62065a37f021c4',
  'a4c0881ea9e08b674bdc81286c07d3fc8dc0cc78',
  'a53a49dde4bc6a7ee8b9cd4b8f60ac724a194fc0'],
 'fieldsOfStudy': ['Computer Science', 'Medicine'],
 'year': 2019,
 'venue': 'BMC biomedical engineering',
 'journalName': 'BMC Biomedical Engineering',
 'journalVolume': '1',
 'journalPages': '',
 'sources': ['Medline'],
 'doi': '10.1186/s42490-019-0003-2',
 'doiUrl': 'https://doi.org/10.1186/s42490-019-0003-2',
 'pmid': '32903308',
 'magId': '2943196448'}

albertvillanova avatar Jan 20 '22 13:01 albertvillanova

#self-assign

lvwerra avatar Jan 26 '22 09:01 lvwerra

Two questions about integrating this for language modeling:

  1. Should title and abstract be concatenated?
  2. The description on the hub and here states that the dataset also includes full text, however, I can not see where they would be. Do you know?

cc @albertvillanova

lvwerra avatar Jan 26 '22 09:01 lvwerra

Also note that note that the dataset contains other languages as well. Looking at a few examples in the dataset viewer on the hub one can see Chinese and Japanese examples as well.

lvwerra avatar Jan 26 '22 10:01 lvwerra

Done: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_s2orc

Note: For now this only contains the titles and abstract separated by a newline. Documents without abstract are filtered (it seems this also helps a bit with the language contamination as samples in other languages often only have a title). cc @albertvillanova @yjernite

Sample:

{'text': "Clinical or Industrial Pharmacy? Case Studies of Hospital Pharmacy Automation in Canada and France...",
 'meta': "{'id': '062e9c7579adc73129e1198671d05905f07d4ab5'}"}

lvwerra avatar Jan 28 '22 09:01 lvwerra

Thanks @lvwerra.

Should I keep this issue open until we know if we will also have the full text?

albertvillanova avatar Jan 31 '22 11:01 albertvillanova