biomedical
biomedical copied to clipboard
Closes #169
Checkbox
- [x] Confirm that this PR is linked to the dataset issue.
- [x] Create the dataloader script
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming). - [x] Provide values for the
_CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables. - [x] Implement
_info()
,_split_generators()
and_generate_examples()
in dataloader script. - [x] Make sure that the
BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema. - [x] Confirm dataloader script works with
datasets.load_dataset
function. - [x] Confirm that your dataloader script passes the test suite run with
python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
. - [x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
This PR closes #169
Thank you for checking out my comments @giyaseddin ! I am trying to inspect the dataset with:
[ins] In [7]: from datasets import load_dataset
ds = load_dataset("biodatasets/medquad/medquad.py", "medquad_source")
But I get this error:
142 raise NotImplementedError("Only `source` and `bigbio_qa` schemas are implemented.")
144 return datasets.DatasetInfo(
145 description=_DESCRIPTION,
146 features=features,
(...)
149 citation=_CITATION,
150 )
--> 152 def _load_qa_from_xml(self, file_paths) -> List[dict[str, str | None]]:
153 """
154 This method traverses the whole list of the downloaded XML files and extracts Q&A pairs.
155 Returns the extracted Q&As and the base directory of the dumped json file that contains them all.
156 """
157 assert len(file_paths)
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
Could you please make sure we can load both source
and bigbio
w/o errors? Thank you!
Hey @giyaseddin! Do you plan to work anymore on this?
Hey @regel-corpus, I will push my last modifications ASAP.
Could you please check the current if it downloads correctly @sg-wbi?
Hi @giyaseddin, I pulled the latest code, and it seems like this error still occurs upon loading. Could you check again if you have fixed it in your updates?
Thank you for checking out my comments @giyaseddin ! I am trying to inspect the dataset with:
[ins] In [7]: from datasets import load_dataset ds = load_dataset("biodatasets/medquad/medquad.py", "medquad_source")
But I get this error:
142 raise NotImplementedError("Only `source` and `bigbio_qa` schemas are implemented.") 144 return datasets.DatasetInfo( 145 description=_DESCRIPTION, 146 features=features, (...) 149 citation=_CITATION, 150 ) --> 152 def _load_qa_from_xml(self, file_paths) -> List[dict[str, str | None]]: 153 """ 154 This method traverses the whole list of the downloaded XML files and extracts Q&A pairs. 155 Returns the extracted Q&As and the base directory of the dumped json file that contains them all. 156 """ 157 assert len(file_paths) TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
Could you please make sure we can load both
source
andbigbio
w/o errors? Thank you!
hi @giyaseddin, thanks for putting the effort to continue working on this dataset. Would it be possible to pull the up-to-date master into your branch? There are some inconsistencies between your branch and master, which blocks running the unit tests. Thanks!