humanify icon indicating copy to clipboard operation
humanify copied to clipboard

Updated to Phi-3.5

Open neoOpus opened this issue 1 year ago • 4 comments

I have a question if you don't mind.

Do you think that using Uncensored models would be better for reverse engineering purposes?

neoOpus avatar Sep 11 '24 19:09 neoOpus

This is the error of the PR, I thought it would be a drop from 3.1 to 3.5 but I guess I have to learn more about the difference between the tokenization of both.

# [2024-09-12 04:24:09]  Loading model with options {
#   modelPath: '/Users/runner/.humanifyjs/models/Phi-3.5-mini-instruct-Q4_K_M.gguf',
#   gpuLayers: 0
# }
# [node-llama-cpp] Using this model ("~/.humanifyjs/models/Phi-3.5-mini-instruct-Q4_K_M.gguf") to tokenize text with special tokens and then detokenize it resulted in a different text. There might be an issue with the model or the tokenizer implementation. Using this model may not work as intended
# Subtest: /Users/runner/work/humanify/humanify/src/test/e2e.geminitest.ts
not ok 1 - /Users/runner/work/humanify/humanify/src/test/e2e.geminitest.ts
  ---

Originally posted by @neoOpus in https://github.com/jehna/humanify/issues/58#issuecomment-2345511559

0xdevalias avatar Sep 12 '24 08:09 0xdevalias

I have a question if you don't mind.

Do you think that using Uncensored models would be better for reverse engineering purposes?

Lets think critically, unless the ai is being asked to create malicious content then it would be a pressure point. However, Humanify will usually be querying the ai model with simply assessing/summarizing the code internally and then it will return variable or function names that closely resemble their function/usage in the code. A censored model is not inherently discarding info instead it will not respond with the censored topics/phrases. It is unlikely the AI model will be rejecting proper names or throwing an error without the back end censor systems being flawed in the first place.

I do think there is merit to supporting diverse set of models, including uncensored. The statement "uncensored is better" is a fallacy as it will be subjective, and the statement is both unverifiable and unfalsifiable. On the other hand, a model that is fine tuned or has data set that is relevant to the code's function will be better as it can help reduce the hallucinations from the AI.

Acters avatar Sep 18 '24 15:09 Acters

I have a question if you don't mind. Do you think that using Uncensored models would be better for reverse engineering purposes?

Lets think critically, unless the ai is being asked to create malicious content then it would be a pressure point. However, Humanify will usually be querying the ai model with simply assessing/summarizing the code internally and then it will return variable or function names that closely resemble their function/usage in the code. A censored model is not inherently discarding info instead it will not respond with the censored topics/phrases. It is unlikely the AI model will be rejecting proper names or throwing an error without the back end censor systems being flawed in the first place.

I do think there is merit to supporting diverse set of models, including uncensored. The statement "uncensored is better" is a fallacy as it will be subjective, and the statement is both unverifiable and unfalsifiable. On the other hand, a model that is fine tuned or has data set that is relevant to the code's function will be better as it can help reduce the hallucinations from the AI.

My suggestion stemmed from the observation that when asking ChatGPT, Gemini, or similar tools to reverse engineer something, they often respond with restrictions. While I know some techniques to bypass this (jailbreaking), I proposed using an uncensored option to conserve pre-instruction tokens.

I respectfully disagree with initial non-support for multiple models. Currently, there isn't an optimal free API or local version that works for everyone.

Sure, that sticking to a single reliable model would minimize bug reports and issues, facilitating development of a shared database that enhances deobfuscation. By avoiding inconsistencies in variable names across models and encouraging experimentation from the start, we can benefit everyone involved. But allowing for experimentation until this reach a certain level of maturity without pushing everyone to create their own fork would be beneficial.

neoOpus avatar Sep 18 '24 22:09 neoOpus

My suggestion stemmed from the observation that when asking ChatGPT, Gemini, or similar tools to reverse engineer something, they often respond with restrictions. While I know some techniques to bypass this (jailbreaking), I proposed using an uncensored option to conserve pre-instruction tokens.

Asking for unethical actions would encounter this type of restriction. However, humanify is only asking for the AI to analyze the code and return new names that fit the usage of the variable or function. Clearly reverse engineering has gotten a bad reputation and I believe it should be allowed, but it is not.

Instead of brazenly asking for reversing the code(which should be worse as it is a complex subject), it is better to break down the tasks required for reversing by asking the AI for clarification on how the code works, or to refactor the code, or help make the code easier to read. This taps into the coding assistant behaviors instead of whatever the censor classifies the reverse engineering as. Do note: humanify does NOT do this as it only asks to help rename stuff.

Sure, that sticking to a single reliable model would minimize bug reports and issues, facilitating development of a shared database that enhances deobfuscation. By avoiding inconsistencies in variable names across models and encouraging experimentation from the start, we can benefit everyone involved. But allowing for experimentation until this reach a certain level of maturity without pushing everyone to create their own fork would be beneficial.

I never said to stick to one model. what I alluded to is for whoever wants to use an AI model for whatever specific purpose would want to fine tune an AI model to produce less hallucinations.

As it stands, humanify does not need a fine tuned model and the support of other models is to be agnostic/independent from one single source.

It also seems that you have the wrong notion on what the AI model is used for. The AI model is NOT used for de-obfuscation. The AI model is used to help create human readable names for variables and functions for un-minification purposes. In no way is the AI touching the code. read the section titled "Don't let AI touch the code" in the project's blog post. https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification

I respectfully disagree with initial non-support for multiple models. Currently, there isn't an optimal free API or local version that works for everyone.

This statement confuses me as humanify does support multiple models, including a localized free model.

Acters avatar Sep 19 '24 00:09 Acters