Open-CyKG
Open-CyKG copied to clipboard
Datasets needed for OIE and NER
Hi Sarhan, for our NLP course project at Berkeley, we are following your paper on opencykg. Just as another user Malcom explained in one of the posts, we also need the datasets you used for the OIE python notebook. I downloaded the malwaretextdb database directly from your paper's reference but that doesn't contain any of the fields required by the downstream code such as : word_id word pred pred_id head_pred_id sent_id run_id label
Can you please give me access to the datasets that are needed to succesfully run the OIE notebook? My email is : [email protected].
We are in a time crunch here with course deadlines approaching. So would be grateful if you could give us access to the datasets that you used for the OIE and NER notebooks.
Thanks, Nitin
Thanks for giving me access to All_MDB.csv. This file contains the fields : word_id, words, sent_id, label. To run the OIE notebook, I still need the following fields in the dataset : word_id word pred pred_id head_pred_id sent_id run_id label
For eg. the code you have from the Stanovsky paper for getting sentences from df, needs the runid : def get_sents_from_df( df): #Split a data frame by rows accroding to the sentences return [df[df.run_id == run_id] for run_id in sorted(set(df.run_id.values))]
And then later on when you call load_dataset_encodeinputs, it needs the following fields : df.word_id = pd.to_numeric(df.word_id, errors='coerce').astype('Int64') df.run_id = pd.to_numeric(df.run_id, errors='coerce').astype('Int64') df.sent_id = pd.to_numeric(df.sent_id, errors='coerce').astype('Int64') df.head_pred_id = pd.to_numeric(df.head_pred_id, errors='coerce').astype('Int64')
Is it possible for you to upload the MalwareDB dataset that you used for the OIE notebook that contains all the fields needed to successfully run the notebook?
I just tried executing the OIE notebook using the new ALL_MDB.csv file you uploaded and as expected I get the following error as it doesn't contain the run_id field. Can you please uploaded the malware db dataset that contains the run_id field. Also can you please clarify what the run_id field is?
I copied the entire column sent_id as run_id and got past the run_id issue but still the dataset doesn't contain all the fields required for the OIE notebook to run correctly. It needs the pred column and complaining on that :
Hi @nitinpi0210,
I am able to run to ~block 7, my code is here: https://colab.research.google.com/drive/1Kh9gsdG2rcySVo-GV5Xc9mW7rx6-pVuW?usp=sharing
But im facing error with Tensorflow, I put it here for you guys, Would really appreciate if you are able to solve the tensorflow issue!
Hi @malcolm1232 can you share your notebook with [email protected] or [email protected]? I can't access it to help you debug :
Also @malcolm1232 how were you able to run until block 7 with the malware db dataset? It doesn't contain all those fields that are needed? Can you please upload the dataset that you ran OIE until Block 7 and give me access?
Sarhan said she will reply later this week as she is busy with her deadlines this week. So as soon as I get her dataset, I will try again too. But in the meantime if you modified the dataset to get it to run to Block 7, can you share that dataset? ([email protected])
Hi, @nitinpi0210 , i have given access. The dataset used was from author under _MSB_all_csv.csv I was able to run via data manipulation from dataset provided by author (Assumingly i did it correctly) Have a good day! do let me know if you run into any troubles
Hi @malcolm1232, I am also facing the same issue below. Can you please give me the access to my email: [email protected].
Regards, Harsh Vardhan Jaiswal
Hi @malcolm1232 can you share your notebook with [email protected] or [email protected]? I can't access it to help you debug :
@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!
@malcolm1232 thanks for sharing. Btw, for the OIE notebook, we were supposed to use the malwaredb dataset as per the author. Why did you use the MSB dataset? That was supposed to be used for the NER notebook as per the paper.
Also is this the right move to do? Can you clarify why you are setting the head pred id to 0 throughout the dataframe?
I updated the public shared folder with OIE dataset that includes all fields
@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!
For the NER ?
I am using the following TF and Keras version :
But running into the following issue in Block 7
This is for the OIE Notebook. @IS5882 what version of TF and Keras is needed for the OIE?
yes ive got the same problem as well, need to try to obtain tensorflow/keras version.
Update: Drive Folder here: https://drive.google.com/drive/folders/1zbf2bLLknxEHLJkcVKKmGHnwB9LseCID
Also, @nitinpi0210 do note that the spacy_wrapper were custom spacy wrapper i created .
Actual Code ; library which is not available anymore:
from spacy_wrapper import spacy_whitespace_parser as spacy_ws
Custom Code I wrote the custom spacy code from what i could undestand of the objective of the initial spacy_ws which is to "split on whitespace characters"
def spacy_ws(input_): # input_ = str(input_) returns_ = input_.split() return returns_
Also, @IS5882 so sorry for the trouble, but the spacy_wrapper.py file is empty U.U sorry for the inconvenience!
Also is this the right move to do? Can you clarify why you are setting the head pred id to 0 throughout the dataframe?
i did this because of the code:
assert(len(set(full_sent.head_pred_id.values)) == 1) # Sanity check If the len values ==1 as sanity check, i assumed it can be any integer.
@malcolm1232 the author gave the new correct malware dataset that has the relevant fields. So you don't need to do all that DF modifications anymore. I just used the new dataset and can get to Block7 with no issues. Now dealing with tensorflow issues.
@malcolm1232 the author gave the new correct malware dataset that has the relevant fields. So you don't need to do all that DF modifications anymore. I just used the new dataset and can get to Block7 with no issues. Now dealing with tensorflow issues.
oh thanks a ton @IS5882 @nitinpi0210 ❤️ ❤️!!!
@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!
For the NER ?
@IS5882 I am trying to Run the OIE Notebook, but have encountered the same tensorflow/keras error as @nitinpi0210 , so just wondering what tensorflow/keras version you were using! oh yes also!! spacy_wrapper.py file is empty U.U
@malcolm1232 @IS5882 The OIE notebook finally works. Didn't need to modify Google colab TF or Keras version and they are both running with their default 2.8.0 versions. What did the trick is the following 2 lines in Block 7 where in the original code it was tensorflow.python.keras..remove the python from there :
from tensorflow.keras.layers import Layer from tensorflow.keras import backend as K
The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset.
Ran to completion finally !
The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset.
Ran to completion finally !
OMMMGGGG!!! Okkays I'll give it a try and let u know!!
hi, @nitinpi0210 i was able to run the notebook as well, but is it possible to share yours so i could take a look at it as well? sorry for the inconvenience!
oh yes, i am wondering if you will be working in the Knowledge graph as well? @nitinpi0210 @IS5882 , i was wondering if you'd have the data for Knowledge_Graph_Canonicalization.ipynb as well!
hey @malcolm1232 whats your email so that I can share? Btw with just those 2 lines you should be able to get things running. Am only doing the OIE piece for now and will do NER and KG later.
hey @malcolm1232 whats your email so that I can share? Btw with just those 2 lines you should be able to get things running. Am only doing the OIE piece for now and will do NER and KG later.
Hi ! @nitinpi0210 my email is [email protected] . Yes i got them running already! But would like to see ur train test split etc. Im still working on the KG, which is even much much tougher to get working without corresponding datatsets xDD
@IS5882 @nitinpi0210 @hvjrocks-ds @hvjrocks-ds
Help me,plz !
@qlM0ri4rty the google verification code is based on your google credentials. When you run the cell, you should get a popup asking you to enter your google username and password. Make sure you enable popups in your browser so that it doesn't get blocked.
@nitinpi0210 Thanks,I just ran the notebook successfully,but i don't know why the output looks like this.I mean,this shouldn't be the NER's result?