Open-CyKG icon indicating copy to clipboard operation
Open-CyKG copied to clipboard

Datasets needed for OIE and NER

Open nitinpi0210 opened this issue 2 years ago • 52 comments

Hi Sarhan, for our NLP course project at Berkeley, we are following your paper on opencykg. Just as another user Malcom explained in one of the posts, we also need the datasets you used for the OIE python notebook. I downloaded the malwaretextdb database directly from your paper's reference but that doesn't contain any of the fields required by the downstream code such as : word_id word pred pred_id head_pred_id sent_id run_id label

Can you please give me access to the datasets that are needed to succesfully run the OIE notebook? My email is : [email protected].

We are in a time crunch here with course deadlines approaching. So would be grateful if you could give us access to the datasets that you used for the OIE and NER notebooks.

Thanks, Nitin

nitinpi0210 avatar Mar 20 '22 14:03 nitinpi0210

Thanks for giving me access to All_MDB.csv. This file contains the fields : word_id, words, sent_id, label. To run the OIE notebook, I still need the following fields in the dataset : word_id word pred pred_id head_pred_id sent_id run_id label

For eg. the code you have from the Stanovsky paper for getting sentences from df, needs the runid : def get_sents_from_df( df): #Split a data frame by rows accroding to the sentences return [df[df.run_id == run_id] for run_id in sorted(set(df.run_id.values))]

And then later on when you call load_dataset_encodeinputs, it needs the following fields : df.word_id = pd.to_numeric(df.word_id, errors='coerce').astype('Int64') df.run_id = pd.to_numeric(df.run_id, errors='coerce').astype('Int64') df.sent_id = pd.to_numeric(df.sent_id, errors='coerce').astype('Int64') df.head_pred_id = pd.to_numeric(df.head_pred_id, errors='coerce').astype('Int64')

Is it possible for you to upload the MalwareDB dataset that you used for the OIE notebook that contains all the fields needed to successfully run the notebook?

nitinpi0210 avatar Mar 20 '22 15:03 nitinpi0210

I just tried executing the OIE notebook using the new ALL_MDB.csv file you uploaded and as expected I get the following error as it doesn't contain the run_id field. Can you please uploaded the malware db dataset that contains the run_id field. Also can you please clarify what the run_id field is?

image

nitinpi0210 avatar Mar 20 '22 15:03 nitinpi0210

I copied the entire column sent_id as run_id and got past the run_id issue but still the dataset doesn't contain all the fields required for the OIE notebook to run correctly. It needs the pred column and complaining on that :

image

image

nitinpi0210 avatar Mar 20 '22 16:03 nitinpi0210

Hi @nitinpi0210,

I am able to run to ~block 7, my code is here: https://colab.research.google.com/drive/1Kh9gsdG2rcySVo-GV5Xc9mW7rx6-pVuW?usp=sharing

But im facing error with Tensorflow, I put it here for you guys, Would really appreciate if you are able to solve the tensorflow issue!

malcolm1232 avatar Mar 22 '22 08:03 malcolm1232

Hi @malcolm1232 can you share your notebook with [email protected] or [email protected]? I can't access it to help you debug :

image

nitinpi0210 avatar Mar 22 '22 12:03 nitinpi0210

Also @malcolm1232 how were you able to run until block 7 with the malware db dataset? It doesn't contain all those fields that are needed? Can you please upload the dataset that you ran OIE until Block 7 and give me access?

Sarhan said she will reply later this week as she is busy with her deadlines this week. So as soon as I get her dataset, I will try again too. But in the meantime if you modified the dataset to get it to run to Block 7, can you share that dataset? ([email protected])

nitinpi0210 avatar Mar 22 '22 12:03 nitinpi0210

Hi, @nitinpi0210 , i have given access. The dataset used was from author under _MSB_all_csv.csv I was able to run via data manipulation from dataset provided by author (Assumingly i did it correctly) Have a good day! do let me know if you run into any troubles

malcolm1232 avatar Mar 23 '22 02:03 malcolm1232

Hi @malcolm1232, I am also facing the same issue below. Can you please give me the access to my email: [email protected].

Regards, Harsh Vardhan Jaiswal

Hi @malcolm1232 can you share your notebook with [email protected] or [email protected]? I can't access it to help you debug :

image

hvjrocks-ds avatar Mar 23 '22 05:03 hvjrocks-ds

@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!

malcolm1232 avatar Mar 23 '22 05:03 malcolm1232

@malcolm1232 thanks for sharing. Btw, for the OIE notebook, we were supposed to use the malwaredb dataset as per the author. Why did you use the MSB dataset? That was supposed to be used for the NER notebook as per the paper.

nitinpi0210 avatar Mar 23 '22 13:03 nitinpi0210

Also is this the right move to do? Can you clarify why you are setting the head pred id to 0 throughout the dataframe?

image

nitinpi0210 avatar Mar 23 '22 13:03 nitinpi0210

I updated the public shared folder with OIE dataset that includes all fields

IS5882 avatar Mar 23 '22 20:03 IS5882

@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!

For the NER ?

IS5882 avatar Mar 23 '22 20:03 IS5882

I am using the following TF and Keras version : image

But running into the following issue in Block 7 image

nitinpi0210 avatar Mar 24 '22 02:03 nitinpi0210

This is for the OIE Notebook. @IS5882 what version of TF and Keras is needed for the OIE?

nitinpi0210 avatar Mar 24 '22 02:03 nitinpi0210

yes ive got the same problem as well, need to try to obtain tensorflow/keras version.

Update: Drive Folder here: https://drive.google.com/drive/folders/1zbf2bLLknxEHLJkcVKKmGHnwB9LseCID

Also, @nitinpi0210 do note that the spacy_wrapper were custom spacy wrapper i created .

Actual Code ; library which is not available anymore:

from spacy_wrapper import spacy_whitespace_parser as spacy_ws

Custom Code I wrote the custom spacy code from what i could undestand of the objective of the initial spacy_ws which is to "split on whitespace characters"

def spacy_ws(input_): # input_ = str(input_) returns_ = input_.split() return returns_

Also, @IS5882 so sorry for the trouble, but the spacy_wrapper.py file is empty U.U sorry for the inconvenience!

malcolm1232 avatar Mar 24 '22 02:03 malcolm1232

Also is this the right move to do? Can you clarify why you are setting the head pred id to 0 throughout the dataframe?

image

i did this because of the code:

assert(len(set(full_sent.head_pred_id.values)) == 1) # Sanity check If the len values ==1 as sanity check, i assumed it can be any integer.

malcolm1232 avatar Mar 24 '22 02:03 malcolm1232

@malcolm1232 the author gave the new correct malware dataset that has the relevant fields. So you don't need to do all that DF modifications anymore. I just used the new dataset and can get to Block7 with no issues. Now dealing with tensorflow issues.

nitinpi0210 avatar Mar 24 '22 02:03 nitinpi0210

@malcolm1232 the author gave the new correct malware dataset that has the relevant fields. So you don't need to do all that DF modifications anymore. I just used the new dataset and can get to Block7 with no issues. Now dealing with tensorflow issues.

oh thanks a ton @IS5882 @nitinpi0210 ❤️ ❤️!!!

malcolm1232 avatar Mar 24 '22 02:03 malcolm1232

@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!

For the NER ?

@IS5882 I am trying to Run the OIE Notebook, but have encountered the same tensorflow/keras error as @nitinpi0210 , so just wondering what tensorflow/keras version you were using! oh yes also!! spacy_wrapper.py file is empty U.U

malcolm1232 avatar Mar 24 '22 02:03 malcolm1232

@malcolm1232 @IS5882 The OIE notebook finally works. Didn't need to modify Google colab TF or Keras version and they are both running with their default 2.8.0 versions. What did the trick is the following 2 lines in Block 7 where in the original code it was tensorflow.python.keras..remove the python from there :

from tensorflow.keras.layers import Layer from tensorflow.keras import backend as K

image

nitinpi0210 avatar Mar 26 '22 00:03 nitinpi0210

The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset. Ran to completion finally ! image

nitinpi0210 avatar Mar 26 '22 00:03 nitinpi0210

The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset.

Ran to completion finally !

image

OMMMGGGG!!! Okkays I'll give it a try and let u know!!

malcolm1232 avatar Mar 26 '22 06:03 malcolm1232

hi, @nitinpi0210 i was able to run the notebook as well, but is it possible to share yours so i could take a look at it as well? sorry for the inconvenience!

malcolm1232 avatar Mar 28 '22 03:03 malcolm1232

oh yes, i am wondering if you will be working in the Knowledge graph as well? @nitinpi0210 @IS5882 , i was wondering if you'd have the data for Knowledge_Graph_Canonicalization.ipynb as well!

malcolm1232 avatar Mar 28 '22 07:03 malcolm1232

hey @malcolm1232 whats your email so that I can share? Btw with just those 2 lines you should be able to get things running. Am only doing the OIE piece for now and will do NER and KG later.

nitinpi0210 avatar Mar 30 '22 19:03 nitinpi0210

hey @malcolm1232 whats your email so that I can share? Btw with just those 2 lines you should be able to get things running. Am only doing the OIE piece for now and will do NER and KG later.

Hi ! @nitinpi0210 my email is [email protected] . Yes i got them running already! But would like to see ur train test split etc. Im still working on the KG, which is even much much tougher to get working without corresponding datatsets xDD

malcolm1232 avatar Mar 31 '22 00:03 malcolm1232

@IS5882 @nitinpi0210 @hvjrocks-ds @hvjrocks-ds Help me,plz ! 9cb6c98a7bd46cc4a45481421770002

qlM0ri4rty avatar May 16 '22 03:05 qlM0ri4rty

@qlM0ri4rty the google verification code is based on your google credentials. When you run the cell, you should get a popup asking you to enter your google username and password. Make sure you enable popups in your browser so that it doesn't get blocked.

nitinpi0210 avatar May 19 '22 22:05 nitinpi0210

@nitinpi0210 Thanks,I just ran the notebook successfully,but i don't know why the output looks like this.I mean,this shouldn't be the NER's result? image

qlM0ri4rty avatar May 20 '22 14:05 qlM0ri4rty