representation-engineering Question about the honesty scores calculation

Question about the honesty scores calculation

Open Jeffwang87 opened this issue 1 year ago • 1 comments

In your honest scores calculation, what is the justification of

results[pos][0][layer][0] * honesty_rep_reader.direction_signs[layer][0]

Why you need to multiply by the direction sign, not just using the results[pos][0][layer][0]

Thanks

Nov 29 '23 18:11 Jeffwang87