LLMs4OL icon indicating copy to clipboard operation
LLMs4OL copied to clipboard

Dataset Help

Open KeyLKey opened this issue 2 years ago • 7 comments

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

KeyLKey avatar Nov 01 '23 13:11 KeyLKey

+1 in seeing the dataset and better instructions. I received error messages in everything I tried.

igorcouto avatar Nov 02 '23 23:11 igorcouto

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Dear @KeyLKey , thanks for the comment, I will add more information on how to create data! Unfortunately due to the LICENSE‌ of UMLS datasets, we might not be able to share it, however, we can provide the details of how to create one.

HamedBabaei avatar Nov 06 '23 08:11 HamedBabaei

+1 in seeing the dataset and better instructions. I received error messages in everything I tried.

Dear @igorcouto , thanks for the comment, can you share the error message with me till I can check what could be the issue and fix it? thanks

HamedBabaei avatar Nov 06 '23 08:11 HamedBabaei

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Dear @KeyLKey , thanks for the comment, I will add more information on how to create data! Unfortunately due to the LICENSE‌ of UMLS datasets, we might not be able to share it, however, we can provide the details of how to create one.

Dear author, could you tell me which data file to download? Is it named UMLS Metathesaurus Full Subset or UMLS Semantic Network files? The former decompressed 27.1GB, and I am very eager to build a dataset like yours. Thank you very much!

KeyLKey avatar Nov 14 '23 14:11 KeyLKey

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Dear @KeyLKey , thanks for the comment, I will add more information on how to create data! Unfortunately due to the LICENSE‌ of UMLS datasets, we might not be able to share it, however, we can provide the details of how to create one.

Dear author, could you tell me which data file to download? Is it named UMLS Metathesaurus Full Subset or UMLS Semantic Network files? The former decompressed 27.1GB, and I am very eager to build a dataset like yours. Thank you very much!

Hi @KeyLKey, for UMLS you need to download the umls-2022AB-metathesaurus-full.zip file and follow the instructions for creating a dataset for Task A, B, C using this notebook TaskA/notebooks/umls-dataset-preprations_for_TaskABC.ipynb (this is available in the repository).

You will build datasets for MEDCIN, NCI, and SNOMEDCT_US.

More later, for Task A, you need to run TaskA/build_entity_dataset.py only the last parts which are as follows:

    config = BaseConfig(version=3).get_args(kb_name="umls")
    umls_builder = dataset_builder(config=config)
    dataset_json, dataset_stats = umls_builder.build()
    for kb in list(dataset_json.keys()):
        DataWriter.write_json(data=dataset_json[kb],
                              path=BaseConfig(version=3).get_args(kb_name=kb.lower()).entity_path)
        DataWriter.write_json(data=dataset_stats[kb],
                              path=BaseConfig(version=3).get_args(kb_name=kb.lower()).dataset_stats)

You need to look at the TaskA/configuration/config.py to make sure you have the right path to be sent to create the dataset.

for task B you need to run the following scripts (please also consider checking those scripts to use only for UMLS)

1. build_hierarchy.py
2. build_datasets.py
3. train_test_split.py

And for C please only run the following script:

1. build_datasets.py
2. train_test_split.py

I hope this helps and Good Luck,

HamedBabaei avatar Nov 15 '23 13:11 HamedBabaei

Dear author, I'm trying to build nci_entities.json following your method, but found that it's missing UMLS_entity_types_with_levels.tsv.May I ask what went wrong? Thank you very much!

tage384 avatar Nov 23 '23 02:11 tage384

Could you help check whether I missed something when trying your package for Worknet?

$ python build_entity_datasets.py --kb_name wn18rr

Traceback (most recent call last):

File "/home/vagrant/rbox/LLMs4OL/TaskA/build_entity_datasets.py", line 11, in

dataset_json, dataset_stats = wn_builder.build()

File "/home/vagrant/rbox/LLMs4OL/TaskA/src/entity_dataset_builder.py", line 16, in build

self.load_artifcats()

File "/home/vagrant/rbox/LLMs4OL/TaskA/src/entity_dataset_builder.py", line 42, in load_artifcats

train, valid, test = self.loader.load_df(self.config.processed_entity_train), \

File "/home/vagrant/rbox/LLMs4OL/TaskA/datahandler/datareader.py", line 51, in load_df

data_frame = pd.read_csv(path)

File "/home/vagrant/rbox/LLMs4OL/llm4ol_py39_env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv

return _read(filepath_or_buffer, kwds)

File "/home/vagrant/rbox/LLMs4OL/llm4ol_py39_env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 620, in _read

parser = TextFileReader(filepath_or_buffer, **kwds)

File "/home/vagrant/rbox/LLMs4OL/llm4ol_py39_env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1620, in init

self._engine = self._make_engine(f, self.engine)

File "/home/vagrant/rbox/LLMs4OL/llm4ol_py39_env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine

self.handles = get_handle(

File "/home/vagrant/rbox/LLMs4OL/llm4ol_py39_env/lib/python3.9/site-packages/pandas/io/common.py", line 873, in get_handle

handle = open(

FileNotFoundError: [Errno 2] No such file or directory: '../datasets/TaskA/WN18RR/processed-3/entity_train.csv'

qiongcheng5 avatar Dec 28 '24 04:12 qiongcheng5