biomedical icon indicating copy to clipboard operation
biomedical copied to clipboard

Proposal to add BioMRC

Open nomisto opened this issue 2 years ago • 5 comments

  • Name: BioMRC
  • Description: BioMRC: A Dataset for Biomedical Machine Reading Comprehension
  • Task: cloze-style MRC / multiple choice QA
  • Paper: https://arxiv.org/abs/2005.06376
  • Data: https://archive.org/details/biomrc_dataset
  • License: Unknown
  • Motivation: Interesting large-scale dataset with 6 different subsets (size and setting). Large variant contains over 700,000 examples.

nomisto avatar Apr 08 '22 14:04 nomisto

Huggingface dataset: https://huggingface.co/datasets/biomrc

nomisto avatar Apr 08 '22 14:04 nomisto

#self-assign

nomisto avatar Apr 08 '22 14:04 nomisto

@nomisto @hakunanatasha unit tests are failing for biomrc_large_A_source and biomrc_large_A_bigbio. The log is provided below:

INFO:__main__:args: Namespace(dataloader_path='biodatasets/biomrc/biomrc.py', data_dir=None, config_name=None)
INFO:__main__:all_config_names: ['biomrc_large_A_source', 'biomrc_large_A_bigbio_qa', 'biomrc_small_A_source', 'biomrc_small_A_bigbio_qa', 'biomrc_tiny_A_source', 'biomrc_tiny_A_bigbio_qa', 'biomrc_large_B_source', 'biomrc_large_B_bigbio_qa', 'biomrc_small_B_source', 'biomrc_small_B_bigbio_qa', 'biomrc_tiny_B_source', 'biomrc_tiny_B_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_source
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_source to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_large_A_source/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 351M/351M [42:38<00:00, 137kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.5M/25.5M [02:39<00:00, 160kB/s]
Downloading:  95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍       | 29.9M/31.6M [05:48<00:18, 86.0kB/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [51:22<00:00, 1027.51s/it]███▎       | 29.9M/31.6M [05:48<00:10, 157kB/s]
 67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                       | 2/3 [00:07<00:03,  3.37s/it] 67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                       | 2/3 [00:08<00:04,  4.33s/it]

======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 661, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 177, in _split_generators
    downloaded_files = dl_manager.download_and_extract(_URLS[version][setting])
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 307, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 283, in extract
    extracted_paths = map_nested(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 260, in map_nested
    mapped = [
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 261, in <listcomp>
    _single_map_nested((function, obj, types, None, True))
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 196, in _single_map_nested
    return function(data_struct)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 325, in cached_path
    output_path = ExtractManager(cache_dir=download_config.cache_dir).extract(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/extract.py", line 40, in extract
    self.extractor.extract(input_path, output_path, extractor=extractor)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/extract.py", line 179, in extract
    return extractor.extract(input_path, output_path)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/extract.py", line 72, in extract
    shutil.copyfileobj(gzip_file, extracted_file)
  File "/opt/anaconda3/lib/python3.9/shutil.py", line 205, in copyfileobj
    buf = fsrc_read(length)
  File "/opt/anaconda3/lib/python3.9/gzip.py", line 300, in read
    return self._buffer.read(size)
  File "/opt/anaconda3/lib/python3.9/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/opt/anaconda3/lib/python3.9/gzip.py", line 506, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

----------------------------------------------------------------------
Ran 1 test in 3092.383s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_bigbio_qa
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_bigbio_qa to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_large_A_bigbio_qa/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8338.58it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1571.49it/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 683, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 1073, in _prepare_split
    for key, record in utils.tqdm(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 207, in _generate_examples
    biomrc = json.load(fp)
  File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 137232381 (char 137232380)

----------------------------------------------------------------------
Ran 1 test in 56.702s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_small_A_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_small_A_source
Downloading and preparing dataset biomrc_dataset/biomrc_small_A_source to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_small_A_source/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
  0%|                                                                                                                                                                              | 0/3 [00:00<?, ?it/s^CTraceback (most recent call last):██▎                                                                                                                               | 7.93M/60.2M [02:05<03:47, 230kB/s]
  File "/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 721, in <module>
    unittest.TextTestRunner().run(TestDataLoader())
  File "/opt/anaconda3/lib/python3.9/unittest/runner.py", line 176, in run
    test(result)
  File "/opt/anaconda3/lib/python3.9/unittest/case.py", line 651, in __call__
    return self.run(*args, **kwds)
  File "/opt/anaconda3/lib/python3.9/unittest/case.py", line 592, in run
    self._callTestMethod(testMethod)
  File "/opt/anaconda3/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
    method()
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 661, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 177, in _split_generators
    downloaded_files = dl_manager.download_and_extract(_URLS[version][setting])
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 307, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 195, in download
    downloaded_path_or_paths = map_nested(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 260, in map_nested
    mapped = [
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 261, in <listcomp>
    _single_map_nested((function, obj, types, None, True))
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 196, in _single_map_nested
    return function(data_struct)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 216, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 298, in cached_path
    output_path = get_from_cache(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 659, in get_from_cache
    http_get(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 464, in http_get
    for chunk in response.iter_content(chunk_size=1024):
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/requests/models.py", line 760, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/urllib3/response.py", line 579, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/urllib3/response.py", line 522, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/opt/anaconda3/lib/python3.9/http/client.py", line 462, in read
    n = self.readinto(b)
  File "/opt/anaconda3/lib/python3.9/http/client.py", line 506, in readinto
    n = self.fp.readinto(b)
  File "/opt/anaconda3/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/opt/anaconda3/lib/python3.9/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/opt/anaconda3/lib/python3.9/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
Downloading:  13%|███████████████████▎                                                                                                                              | 7.95M/60.2M [02:05<13:44, 63.3kB/s]
  0%|                                                                                                                                                                              | 0/3 [02:09<?, ?it/s]

(bigbioenv) (base) -----------------------------------------------------------------------------------------------------------------------------------------------------
~/repo/bigscience/biomedical (master*) » python3 -m tests.test_bigbio biodatasets/biomrc/biomrc.py                                                                     130 ↵ skang@Myungsuns-MacBook-Pro

INFO:__main__:args: Namespace(dataloader_path='biodatasets/biomrc/biomrc.py', data_dir=None, config_name=None)
INFO:__main__:all_config_names: ['biomrc_large_A_source', 'biomrc_large_A_bigbio_qa', 'biomrc_small_A_source', 'biomrc_small_A_bigbio_qa', 'biomrc_tiny_A_source', 'biomrc_tiny_A_bigbio_qa', 'biomrc_large_B_source', 'biomrc_large_B_bigbio_qa', 'biomrc_small_B_source', 'biomrc_small_B_bigbio_qa', 'biomrc_tiny_B_source', 'biomrc_tiny_B_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_source
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_source to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_large_A_source/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 6605.20it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1451.99it/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 683, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 1073, in _prepare_split
    for key, record in utils.tqdm(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 197, in _generate_examples
    biomrc = json.load(fp)
  File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 137232381 (char 137232380)

----------------------------------------------------------------------
Ran 1 test in 96.119s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_bigbio_qa
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_bigbio_qa to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_large_A_bigbio_qa/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9931.26it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1606.19it/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 683, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 1073, in _prepare_split
    for key, record in utils.tqdm(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 207, in _generate_examples
    biomrc = json.load(fp)
  File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 137232381 (char 137232380)

----------------------------------------------------------------------
Ran 1 test in 77.020s

FAILED (errors=1)

sunnnymskang avatar May 01 '22 21:05 sunnnymskang

Hi @sunnnymskang , I think what happened here is that you terminated the dowload when it was not finished previously, thus you get the Unterminated string error since the file was only partially downloaded. The test python -m tests.test_bigbio biodatasets/biomrc/biomrc.py --config_name biomrc_large_A_source gives OK for me.

Could you try wiping your cache and redownload (I know this is a little painful because of the slow download rate)? If you don't want to wipe your whole cache you can try adding download_mode=GenerateMode.FORCE_REDOWNLOAD in the load_dataset call in L126 of test_bigbio.py:

       self.dataset = datasets.load_dataset(
            self.PATH,
            name=self.NAME,
            data_dir=self.DATA_DIR,
            download_mode=GenerateMode.FORCE_REDOWNLOAD
        )
```

nomisto avatar May 02 '22 06:05 nomisto

Proof :grin:

(venv) C:\Users\ottsi\biomedical>python -m tests.test_bigbio biodatasets/biomrc/biomrc.py --config_name biomrc_large_A_source 
INFO:__main__:args: Namespace(config_name='biomrc_large_A_source', data_dir=None, dataloader_path='biodatasets/biomrc/biomrc.py')
INFO:__main__:all_config_names: ['biomrc_large_A_source', 'biomrc_large_A_bigbio_qa', 'biomrc_small_A_source', 'biomrc_small_A_bigbio_qa', 'biomrc_tiny_A_source', 'biomrc_tiny_A_bigbio_qa', 'biomrc_large_B_source', 'biomrc_large_B_bigbio_qa', 'biomrc_small_B_source', 'biomrc_small_B_bigbio_qa', 'biomrc_tiny_B_source', 'biomrc_tiny_B_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from 'C:\\Users\\ottsi\\biomedical\\biodatasets\\biomrc\\biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_source
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_source to C:\Users\ottsi\.cache\huggingface\datasets\biomrc_dataset\biomrc_large_A_source\1.0.0\8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 351M/351M [27:49<00:00, 210kB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.5M/25.5M [01:25<00:00, 299kB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31.6M/31.6M [02:24<00:00, 219kB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [31:52<00:00, 637.39s/it]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00,  2.90s/it]
Dataset biomrc_dataset downloaded and prepared to C:\Users\ottsi\.cache\huggingface\datasets\biomrc_dataset\biomrc_large_A_source\1.0.0\8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 11.77it/s] 
INFO:__main__:schema = source
.
----------------------------------------------------------------------
Ran 1 test in 2025.048s

OK

nomisto avatar May 02 '22 07:05 nomisto