MalGraph
MalGraph copied to clipboard
fcg cfg
Hello author, how to process fcg and cfg into jsonl
Thanks for your attention. Based on the implementation of Genius (see https://github.com/qian-feng/Gencoding for details), we disassemble all PE samples in the dataset with IDA Pro 6.4, then generate their FCGs and CFGs accordingly, and finally store them in the JSONL file format.
Thanks a lot, can both FCGs and CFGs be handled by Genius (https://github.com/qian-feng/Gencoding)?
Not yet. But I recalled that Genius is used to extract CFGs and it is easy to generate FCG based the framework of Genius.
thanks, got it
Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...
So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.
Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...
So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.
Any solution? I also encounter with the problem. And there is a file called train_external_function_name_vocab.jsonl before model training, I have no idea about how to generate this file either.
Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ... So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.
Any solution? I also encounter with the problem. And there is a file called train_external_function_name_vocab.jsonl before model training, I have no idea about how to generate this file either.
I don't know how to generate this file train_external_function_name_vocab.jsonl either, do you have a solution?
reply to @KennenH and @Divine-sh :
As we have described in Section IV.A.2)
For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset.
And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.
reply to @KennenH and @Divine-sh :
As we have described in Section IV.A.2)
For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset.
And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.
Can you give me a way to reach the top 10,000 you mentioned?
@20521862
I think it is quite clear as paper said:
it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset
count calling times for every external function and perform a sort.
@20521862 I think it is quite clear as paper said:
it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset
count calling times for every external function and perform a sort.
So does that mean I will have to parse all the PE file then collect the function names that have been called 10,000 times in the training data and save it in train_external_function_name_vocab.jsonl?
@20521862
10,000(external functions) that are most frequently used
Not saving the functions that were called 10,000 times, but taking the first 10,000 functions that were called the most times.
reply to @KennenH and @Divine-sh :
As we have described in Section IV.A.2)
For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset.
And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.
@ryderling Very much thanks for your reply. But I have another question, as mentioned earlier by @lizhangtan in this issue (the sixth post of this issue).
Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...
So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.
I also used Genius (https://github.com/qian-feng/Gencoding) to process the assembly file of PE and obtained its output() .ida file. How can I obtain the CFG in JSONL file format from this .ida file? Would you please provide me more details, Any help would be greatly appreciated!
reply to @KennenH and @Divine-sh : As we have described in Section IV.A.2)
For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset.
And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.@ryderling Very much thanks for your reply. But I have another question, as mentioned earlier by @lizhangtan in this issue (the sixth post of this issue).
Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ... So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.
I also used Genius (https://github.com/qian-feng/Gencoding) to process the assembly file of PE and obtained its output() .ida file. How can I obtain the CFG in JSONL file format from this .ida file? Would you please provide me more details, Any help would be greatly appreciated!
@lizhangtan I've figured it out, it's actually data saved through pickle, reload it with pickle and you can get a readable object.