MalGraph icon indicating copy to clipboard operation
MalGraph copied to clipboard

fcg cfg

Open lqc09 opened this issue 2 years ago • 14 comments

Hello author, how to process fcg and cfg into jsonl

lqc09 avatar Apr 10 '22 09:04 lqc09

Thanks for your attention. Based on the implementation of Genius (see https://github.com/qian-feng/Gencoding for details), we disassemble all PE samples in the dataset with IDA Pro 6.4, then generate their FCGs and CFGs accordingly, and finally store them in the JSONL file format.

ryderling avatar Apr 11 '22 02:04 ryderling

Thanks a lot, can both FCGs and CFGs be handled by Genius (https://github.com/qian-feng/Gencoding)?

lqc09 avatar Apr 11 '22 02:04 lqc09

Not yet. But I recalled that Genius is used to extract CFGs and it is easy to generate FCG based the framework of Genius.

ryderling avatar Apr 11 '22 02:04 ryderling

thanks, got it

lqc09 avatar Apr 11 '22 03:04 lqc09

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

lizhangtan avatar May 06 '22 12:05 lizhangtan

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

Any solution? I also encounter with the problem. And there is a file called train_external_function_name_vocab.jsonl before model training, I have no idea about how to generate this file either.

KennenH avatar Jun 25 '23 14:06 KennenH

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ... So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

Any solution? I also encounter with the problem. And there is a file called train_external_function_name_vocab.jsonl before model training, I have no idea about how to generate this file either.

I don't know how to generate this file train_external_function_name_vocab.jsonl either, do you have a solution?

Divine-sh avatar Jun 29 '23 09:06 Divine-sh

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

ryderling avatar Jun 29 '23 09:06 ryderling

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

Can you give me a way to reach the top 10,000 you mentioned?

20521862 avatar Jul 01 '23 07:07 20521862

@20521862 I think it is quite clear as paper said: it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset count calling times for every external function and perform a sort.

KennenH avatar Jul 03 '23 03:07 KennenH

@20521862 I think it is quite clear as paper said: it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset count calling times for every external function and perform a sort.

So does that mean I will have to parse all the PE file then collect the function names that have been called 10,000 times in the training data and save it in train_external_function_name_vocab.jsonl?

20521862 avatar Jul 04 '23 13:07 20521862

@20521862 10,000(external functions) that are most frequently used Not saving the functions that were called 10,000 times, but taking the first 10,000 functions that were called the most times.

KennenH avatar Jul 04 '23 13:07 KennenH

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

@ryderling Very much thanks for your reply. But I have another question, as mentioned earlier by @lizhangtan in this issue (the sixth post of this issue).

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

I also used Genius (https://github.com/qian-feng/Gencoding) to process the assembly file of PE and obtained its output() .ida file. How can I obtain the CFG in JSONL file format from this .ida file? Would you please provide me more details, Any help would be greatly appreciated!

KennenH avatar Jul 19 '23 03:07 KennenH

reply to @KennenH and @Divine-sh : As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

@ryderling Very much thanks for your reply. But I have another question, as mentioned earlier by @lizhangtan in this issue (the sixth post of this issue).

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ... So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

I also used Genius (https://github.com/qian-feng/Gencoding) to process the assembly file of PE and obtained its output() .ida file. How can I obtain the CFG in JSONL file format from this .ida file? Would you please provide me more details, Any help would be greatly appreciated!

@lizhangtan I've figured it out, it's actually data saved through pickle, reload it with pickle and you can get a readable object.

KennenH avatar Jul 26 '23 03:07 KennenH