MalGraph fcg cfg

Hello author, how to process fcg and cfg into jsonl

Apr 10 '22 09:04 lqc09

Thanks for your attention. Based on the implementation of Genius (see https://github.com/qian-feng/Gencoding for details), we disassemble all PE samples in the dataset with IDA Pro 6.4, then generate their FCGs and CFGs accordingly, and finally store them in the JSONL file format.

Apr 11 '22 02:04 ryderling

Thanks a lot, can both FCGs and CFGs be handled by Genius (https://github.com/qian-feng/Gencoding)?

Apr 11 '22 02:04 lqc09

Not yet. But I recalled that Genius is used to extract CFGs and it is easy to generate FCG based the framework of Genius.

Apr 11 '22 02:04 ryderling

thanks, got it

Apr 11 '22 03:04 lqc09

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

May 06 '22 12:05 lizhangtan

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

Any solution? I also encounter with the problem. And there is a file called train_external_function_name_vocab.jsonl before model training, I have no idea about how to generate this file either.

Jun 25 '23 14:06 KennenH

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ... So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

Any solution? I also encounter with the problem. And there is a file called train_external_function_name_vocab.jsonl before model training, I have no idea about how to generate this file either.

I don't know how to generate this file train_external_function_name_vocab.jsonl either, do you have a solution?

Jun 29 '23 09:06 Divine-sh

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

Jun 29 '23 09:06 ryderling

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

Can you give me a way to reach the top 10,000 you mentioned?

Jul 01 '23 07:07 20521862

@20521862 I think it is quite clear as paper said: it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset count calling times for every external function and perform a sort.

Jul 03 '23 03:07 KennenH

@20521862 I think it is quite clear as paper said: it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset count calling times for every external function and perform a sort.

So does that mean I will have to parse all the PE file then collect the function names that have been called 10,000 times in the training data and save it in train_external_function_name_vocab.jsonl?

Jul 04 '23 13:07 20521862

@20521862 10,000(external functions) that are most frequently used Not saving the functions that were called 10,000 times, but taking the first 10,000 functions that were called the most times.

Jul 04 '23 13:07 KennenH

reply to @KennenH and @Divine-sh :

As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

@ryderling Very much thanks for your reply. But I have another question, as mentioned earlier by @lizhangtan in this issue (the sixth post of this issue).

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ...

So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

I also used Genius (https://github.com/qian-feng/Gencoding) to process the assembly file of PE and obtained its output() .ida file. How can I obtain the CFG in JSONL file format from this .ida file? Would you please provide me more details, Any help would be greatly appreciated!

Jul 19 '23 03:07 KennenH

reply to @KennenH and @Divine-sh : As we have described in Section IV.A.2) For each node representing the external function in FCG, it is one-hot encoded based on its function name and we limit the vocabulary size of external functions to 10,000 that are most frequently used in the training dataset. And the file train_external_function_name_vocab.jsonl is used to store the TOP 10000 external function names in the training dataset.

@ryderling Very much thanks for your reply. But I have another question, as mentioned earlier by @lizhangtan in this issue (the sixth post of this issue).

Hello author, I have read your paper and also tried to use Genius (https://github.com/qian-feng/Gencoding) to get the CFGs from PE samples. I ran the code in the preprocessing_ida.py of Genius and got the output (XXX.ida) like this: (i__main__ raw_graphs p1 (dp2 S'raw_graph_list' p3 (lp4 (iraw_graphs raw_graph p5 (dp6 S'entry' p7 I0 sS'fun_features' ... So how can I get the CFG in JSONL file format? I appreciate it if you can give more details about how to use Genius to generate CFGs in JSONL file format.

I also used Genius (https://github.com/qian-feng/Gencoding) to process the assembly file of PE and obtained its output() .ida file. How can I obtain the CFG in JSONL file format from this .ida file? Would you please provide me more details, Any help would be greatly appreciated!

@lizhangtan I've figured it out, it's actually data saved through pickle, reload it with pickle and you can get a readable object.

Jul 26 '23 03:07 KennenH

MalGraph MalGraph copied to clipboard

fcg cfg

MalGraph
MalGraph copied to clipboard