biomedical
biomedical copied to clipboard
Proposal to add BioMRC
- Name: BioMRC
- Description: BioMRC: A Dataset for Biomedical Machine Reading Comprehension
- Task: cloze-style MRC / multiple choice QA
- Paper: https://arxiv.org/abs/2005.06376
- Data: https://archive.org/details/biomrc_dataset
- License: Unknown
- Motivation: Interesting large-scale dataset with 6 different subsets (size and setting). Large variant contains over 700,000 examples.
Huggingface dataset: https://huggingface.co/datasets/biomrc
#self-assign
@nomisto @hakunanatasha unit tests are failing for biomrc_large_A_source and biomrc_large_A_bigbio. The log is provided below:
INFO:__main__:args: Namespace(dataloader_path='biodatasets/biomrc/biomrc.py', data_dir=None, config_name=None)
INFO:__main__:all_config_names: ['biomrc_large_A_source', 'biomrc_large_A_bigbio_qa', 'biomrc_small_A_source', 'biomrc_small_A_bigbio_qa', 'biomrc_tiny_A_source', 'biomrc_tiny_A_bigbio_qa', 'biomrc_large_B_source', 'biomrc_large_B_bigbio_qa', 'biomrc_small_B_source', 'biomrc_small_B_bigbio_qa', 'biomrc_tiny_B_source', 'biomrc_tiny_B_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_source
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_source to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_large_A_source/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 351M/351M [42:38<00:00, 137kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.5M/25.5M [02:39<00:00, 160kB/s]
Downloading: 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 29.9M/31.6M [05:48<00:18, 86.0kB/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [51:22<00:00, 1027.51s/it]███▎ | 29.9M/31.6M [05:48<00:10, 157kB/s]
67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 2/3 [00:07<00:03, 3.37s/it] 67%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 2/3 [00:08<00:04, 4.33s/it]
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 661, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 177, in _split_generators
downloaded_files = dl_manager.download_and_extract(_URLS[version][setting])
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 307, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 283, in extract
extracted_paths = map_nested(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 260, in map_nested
mapped = [
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 261, in <listcomp>
_single_map_nested((function, obj, types, None, True))
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 196, in _single_map_nested
return function(data_struct)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 325, in cached_path
output_path = ExtractManager(cache_dir=download_config.cache_dir).extract(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/extract.py", line 40, in extract
self.extractor.extract(input_path, output_path, extractor=extractor)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/extract.py", line 179, in extract
return extractor.extract(input_path, output_path)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/extract.py", line 72, in extract
shutil.copyfileobj(gzip_file, extracted_file)
File "/opt/anaconda3/lib/python3.9/shutil.py", line 205, in copyfileobj
buf = fsrc_read(length)
File "/opt/anaconda3/lib/python3.9/gzip.py", line 300, in read
return self._buffer.read(size)
File "/opt/anaconda3/lib/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/opt/anaconda3/lib/python3.9/gzip.py", line 506, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
----------------------------------------------------------------------
Ran 1 test in 3092.383s
FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_bigbio_qa
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_bigbio_qa to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_large_A_bigbio_qa/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8338.58it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1571.49it/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 683, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 1073, in _prepare_split
for key, record in utils.tqdm(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 207, in _generate_examples
biomrc = json.load(fp)
File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 137232381 (char 137232380)
----------------------------------------------------------------------
Ran 1 test in 56.702s
FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_small_A_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_small_A_source
Downloading and preparing dataset biomrc_dataset/biomrc_small_A_source to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_small_A_source/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
0%| | 0/3 [00:00<?, ?it/s^CTraceback (most recent call last):██▎ | 7.93M/60.2M [02:05<03:47, 230kB/s]
File "/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 721, in <module>
unittest.TextTestRunner().run(TestDataLoader())
File "/opt/anaconda3/lib/python3.9/unittest/runner.py", line 176, in run
test(result)
File "/opt/anaconda3/lib/python3.9/unittest/case.py", line 651, in __call__
return self.run(*args, **kwds)
File "/opt/anaconda3/lib/python3.9/unittest/case.py", line 592, in run
self._callTestMethod(testMethod)
File "/opt/anaconda3/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
method()
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 661, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 177, in _split_generators
downloaded_files = dl_manager.download_and_extract(_URLS[version][setting])
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 307, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 195, in download
downloaded_path_or_paths = map_nested(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 260, in map_nested
mapped = [
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 261, in <listcomp>
_single_map_nested((function, obj, types, None, True))
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 196, in _single_map_nested
return function(data_struct)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 216, in _download
return cached_path(url_or_filename, download_config=download_config)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 298, in cached_path
output_path = get_from_cache(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 659, in get_from_cache
http_get(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 464, in http_get
for chunk in response.iter_content(chunk_size=1024):
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/requests/models.py", line 760, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/urllib3/response.py", line 579, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/urllib3/response.py", line 522, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/opt/anaconda3/lib/python3.9/http/client.py", line 462, in read
n = self.readinto(b)
File "/opt/anaconda3/lib/python3.9/http/client.py", line 506, in readinto
n = self.fp.readinto(b)
File "/opt/anaconda3/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/opt/anaconda3/lib/python3.9/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/opt/anaconda3/lib/python3.9/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
KeyboardInterrupt
Downloading: 13%|███████████████████▎ | 7.95M/60.2M [02:05<13:44, 63.3kB/s]
0%| | 0/3 [02:09<?, ?it/s]
(bigbioenv) (base) -----------------------------------------------------------------------------------------------------------------------------------------------------
~/repo/bigscience/biomedical (master*) » python3 -m tests.test_bigbio biodatasets/biomrc/biomrc.py 130 ↵ skang@Myungsuns-MacBook-Pro
INFO:__main__:args: Namespace(dataloader_path='biodatasets/biomrc/biomrc.py', data_dir=None, config_name=None)
INFO:__main__:all_config_names: ['biomrc_large_A_source', 'biomrc_large_A_bigbio_qa', 'biomrc_small_A_source', 'biomrc_small_A_bigbio_qa', 'biomrc_tiny_A_source', 'biomrc_tiny_A_bigbio_qa', 'biomrc_large_B_source', 'biomrc_large_B_bigbio_qa', 'biomrc_small_B_source', 'biomrc_small_B_bigbio_qa', 'biomrc_tiny_B_source', 'biomrc_tiny_B_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_source
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_source to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_large_A_source/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 6605.20it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1451.99it/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 683, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 1073, in _prepare_split
for key, record in utils.tqdm(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 197, in _generate_examples
biomrc = json.load(fp)
File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 137232381 (char 137232380)
----------------------------------------------------------------------
Ran 1 test in 96.119s
FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from '/Users/skang/repo/bigscience/biomedical/biodatasets/biomrc/biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_bigbio_qa
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_bigbio_qa to /Users/skang/.cache/huggingface/datasets/biomrc_dataset/biomrc_large_A_bigbio_qa/1.0.0/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 9931.26it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1606.19it/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
self.dataset = datasets.load_dataset(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
builder_instance.download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
self._download_and_prepare(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 683, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 1073, in _prepare_split
for key, record in utils.tqdm(
File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/Users/skang/.cache/huggingface/modules/datasets_modules/datasets/biomrc/8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb/biomrc.py", line 207, in _generate_examples
biomrc = json.load(fp)
File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/anaconda3/lib/python3.9/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 137232381 (char 137232380)
----------------------------------------------------------------------
Ran 1 test in 77.020s
FAILED (errors=1)
Hi @sunnnymskang , I think what happened here is that you terminated the dowload when it was not finished previously, thus you get the Unterminated string
error since the file was only partially downloaded. The test python -m tests.test_bigbio biodatasets/biomrc/biomrc.py --config_name biomrc_large_A_source
gives OK for me.
Could you try wiping your cache and redownload (I know this is a little painful because of the slow download rate)?
If you don't want to wipe your whole cache you can try adding download_mode=GenerateMode.FORCE_REDOWNLOAD
in the load_dataset call in L126 of test_bigbio.py:
self.dataset = datasets.load_dataset(
self.PATH,
name=self.NAME,
data_dir=self.DATA_DIR,
download_mode=GenerateMode.FORCE_REDOWNLOAD
)
```
Proof :grin:
(venv) C:\Users\ottsi\biomedical>python -m tests.test_bigbio biodatasets/biomrc/biomrc.py --config_name biomrc_large_A_source
INFO:__main__:args: Namespace(config_name='biomrc_large_A_source', data_dir=None, dataloader_path='biodatasets/biomrc/biomrc.py')
INFO:__main__:all_config_names: ['biomrc_large_A_source', 'biomrc_large_A_bigbio_qa', 'biomrc_small_A_source', 'biomrc_small_A_bigbio_qa', 'biomrc_tiny_A_source', 'biomrc_tiny_A_bigbio_qa', 'biomrc_large_B_source', 'biomrc_large_B_bigbio_qa', 'biomrc_small_B_source', 'biomrc_small_B_bigbio_qa', 'biomrc_tiny_B_source', 'biomrc_tiny_B_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/biomrc/biomrc.py
INFO:__main__:self.NAME: biomrc_large_A_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.biomrc.biomrc' from 'C:\\Users\\ottsi\\biomedical\\biodatasets\\biomrc\\biomrc.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name biomrc_large_A_source
Downloading and preparing dataset biomrc_dataset/biomrc_large_A_source to C:\Users\ottsi\.cache\huggingface\datasets\biomrc_dataset\biomrc_large_A_source\1.0.0\8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb...
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 351M/351M [27:49<00:00, 210kB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25.5M/25.5M [01:25<00:00, 299kB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31.6M/31.6M [02:24<00:00, 219kB/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [31:52<00:00, 637.39s/it]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00, 2.90s/it]
Dataset biomrc_dataset downloaded and prepared to C:\Users\ottsi\.cache\huggingface\datasets\biomrc_dataset\biomrc_large_A_source\1.0.0\8102ce67a68faf198693600a49d5fe703fd24437abded1560b11730edab9cddb. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 11.77it/s]
INFO:__main__:schema = source
.
----------------------------------------------------------------------
Ran 1 test in 2025.048s
OK