SVF icon indicating copy to clipboard operation
SVF copied to clipboard

WPA tool crashes with -locMM switch on string handling

Open pietroborrello opened this issue 5 years ago • 14 comments

I'm trying to build an analysis where pointers stored in an array do not share the same points-to set, to be able to distinguish the objects that may be pointed by each of the pointers in the array. However, enabling the -locMM flag (for which I'm still not sure it would solve my problem but saw #11 ), segfaults the wpa tool during PointerAnalysis initialization, while parsing string accesses.

To reproduce (a minimal poc): main.c

#include <stdio.h>
int main(int argc, char **argv) {
	printf("hello\n");
	return 0;
}

then execute:

$ clang -c -fno-discard-value-names -emit-llvm main.c -o main.bc
$ opt -mem2reg main.bc -o main.opt
$ wpa -ander -debug --dwarn=0 --print-all-pts -locMM -stats=0 main.opt

The tool segfaults with the following output:

> $ wpa -ander -debug --dwarn=0 --print-all-pts -locMM -stats=0 main.opt                                                            
Args: wpa -ander -debug --dwarn=0 --print-all-pts -locMM -stats=0 main.opt 
Building PAG ...
Building Symbol table ...
collect sym from ##@.str = private unnamed_addr constant [7 x i8] c"hello\0A\00", align 1 
create a new value sym 4
handle constant expression i8* getelementptr inbounds ([7 x i8], [7 x i8]* @.str, i64 0, i64 0)
create a new value sym 5
[1]    24137 segmentation fault (core dumped)  wpa -ander -debug --dwarn=0 --print-all-pts -locMM -stats=0 main.opt

At a quick analysis in a debugger the following stack trace is generated:

[#0] 0x555555636122 → SymbolTableInfo::getStructInfo(this=0x5555566cfbf0, T=0x5555566d1150)
[#1] 0x555555632be1 → SymbolTableInfo::isConstantObjSym(this=0x5555566cfbf0, val=0x5555566d2858)
[#2] 0x5555556323e8 → SymbolTableInfo::collectVal(this=0x5555566cfbf0, val=0x5555566d2858)
[#3] 0x555555632db7 → SymbolTableInfo::handleCE(this=0x5555566cfbf0, val=0x5555566d90c8)
[#4] 0x55555563319b → SymbolTableInfo::handleGlobalCE(this=0x5555566cfbf0, G=0x5555566d2858)
[#5] 0x5555556323cf → SymbolTableInfo::collectVal(this=0x5555566cfbf0, val=0x5555566d2858)
[#6] 0x555555633033 → SymbolTableInfo::handleCE(this=0x5555566cfbf0, val=0x5555566d2858)
[#7] 0x5555556321c8 → SymbolTableInfo::collectSym(this=0x5555566cfbf0, val=0x5555566d2858)
[#8] 0x555555631883 → SymbolTableInfo::buildMemModel(this=0x5555566cfbf0, svfModule=0x5555566c9610)
[#9] 0x5555556a22a1 → PointerAnalysis::initialize(this=0x5555566cc368, svfModule=0x5555566c9610)

And the tool seems to crash in the following statement in lib/SVF-FE/SymbolTableInfo.cpp:

    341	     ///Get a reference to StructInfo.
    342	     inline StInfo* getStructInfo(const Type *T) {
 →  343	         return getStructInfoIter(T)->second;
    344	     }
_____________________________________________________________
 → 0x555555636122 <SymbolTableInfo::getStructInfo(llvm::Type+0> mov    rax, QWORD PTR [rax+0x8]

Due to getStructInfoIter(T) returning NULL. I suspect Type information not getting properly initialized.

pietroborrello avatar Sep 28 '20 17:09 pietroborrello

-locMM has not been supported for quite a while after SVF migrated to LLVM-7.0 onwards. We would appreciate you can adapt or help us with this option.

yuleisui avatar Oct 01 '20 09:10 yuleisui

If I understand it correctly, with this option SVF would allow me to treat array items as if they were different fields of a struct. Is that correct? In an array of pointers, I would like to be able to differentiate the points-to sets at least for constant indexes of the array.

I have something like this:

#define M_POINTS    8
void foo() {
	unsigned char* M[M_POINTS];
	int i = 0;

	#pragma unroll(M_POINTS)
	for (i=0; i < M_POINTS; ++i) {
		M[i] = (unsigned char*)malloc(1024);
		if (M[i] == 0) {
			return;
		}
	}

	/* operations on the array of pointers */
	printf("%p\n", M[0]);
	printf("%p\n", M[1]);
	return;
}

I have manually set the malloc loop to be unrolled, otherwise, all the objects would be from the same allocation site, and all the pointers would share the same object node if I'm not wrong. I have not only operations on fixed indexes, but also operations striding all the indices of the array of pointers, like:

for (i=0; i < M_POINTS; ++i) {
    bar(M[i]);
}

In this example bar then performs memory accesses using the pointer argument. Is there any way to be able to distinguish also these operations? I don't think so without unrolling the loop and inlining the call, but correct me if I'm wrong. I have to analyze the possible objects an operation may access, and minimizing the object set would really help. In my use case, I can apply semantic preserving code transformations.

If you could confirm that -locMM would help me to manage these issues, I will be happy to try looking into it and post my findings here. Have you any hint on where to start looking?

pietroborrello avatar Oct 01 '20 17:10 pietroborrello

Hi @yuleisui do you have any guess on what may have broken the -locMM option starting from LLVM-7.0? And do you have any thoughts on my previous message? Thank you

pietroborrello avatar Oct 06 '20 17:10 pietroborrello

Sorry for the late reply. some time to debug. Overloaded recently... The implementation was quite a while ago and was not used or tested since 9.0.0. I will have to find some time to debug if you can not solve it.

The key idea is to get the byte-size of an object. You may wish to take a look at this first: https://github.com/SVF-tools/SVF/blob/master/include/SVF-FE/SymbolTableInfo.h#L451-L503

yuleisui avatar Oct 07 '20 08:10 yuleisui

No problem and thank you for all the work on the project :)

It seems the crash being caused by getStructInfoIter calling collectTypeInfo on an ArrayType, which then calls collectArrayInfo that is an empty function (all code commented out which then does not fill the type info in typeToFieldInfo) for LocSymTableInfo::collectArrayInfo. It seems the whole function body was commented out in commit a569935b57d8aa80ddad405890aa6736c70b1a07 along with different changes on MemModel.

I'm not sure if the fix would be simply to uncomment the code and port that to use the new methods in the class, or if the commented code was done intentionally to move the locMM logic outside, so should remain commented and the fix to do is somewhere else.

pietroborrello avatar Oct 07 '20 13:10 pietroborrello

You can uncomment the code and have a try and gradually adapt the code to this new version.

yuleisui avatar Oct 07 '20 13:10 yuleisui

Ok I have a working version of the locMM switch on the last commit now, but before issuing a PR I would like to understand something better:

Looking at the code it seems that the functionality of LocSymTableInfo::collectArrayInfo and LocSymTableInfo::collectStructInfo have been merged into SymbolTableInfo::collectArrayInfo and SymbolTableInfo::collectStructInfo that now fill the FieldInfo of the types both with field indexes and offsets. Indeed removing the methods from LocSymTableInfo and making it use the parent class ones seems to correctly fill the metadata.

The problem is that in various places in the code the LocationSet::getOffset() method is used, which returns the field index.

Examples of such usage are in PAG::addGepObjNode with:

    NodeID gepId = (ls.getOffset() + 1) * gepMultiplier + base;

or in PAG::getGepObjNode with:

    // Base and first field are the same memory location.
    if (FirstFieldEqBase && ls.getOffset() == 0) return base;

This makes the computations to reason only on the field index. Manually changing such occurrences to LocationSet::getByteOffset() makes the -locMM switch to work, but obviously would break "non locMM" compatibility.

And in fact running the test suite, the following tests fail:

The following tests FAILED:
	  4 - basic_c_tests/array-constIdx.c (Child aborted)
	 56 - fs_tests/array_alias_1.c (Child aborted)
	 57 - fs_tests/array_alias_2.c (Child aborted)
	 59 - fs_tests/array_alias_4.c (Child aborted)
	 60 - fs_tests/array_alias_5.c (Child aborted)

As a reference the first one is reported:

#include "aliascheck.h"

struct MyStruct {
	int * f1;
	int * f2;
};

int main() {
	struct MyStruct s[2];
	int a,b;
	s[0].f1 = &a;
	s[1].f1 = &b;

	// Different fields of different elements in a 
	// certain array are treated as different objects.
	NOALIAS(s[0].f1, s[1].f2);
	MAYALIAS(s[0].f1, s[1].f1);

	return 0;
}

which fails on

[AndersenWPA] Checking MAYALIAS
	 FAILURE :MAYALIAS check <id:289, id:293> at ({  })

This should show that the 0th and 1st entry of the array are treated as a two different fields, as in locMM

Have you any advice on how to deal with that? I could add a method to SymbolTableInfo to provide a boolean locMM state, on which decide if the code should use LocationSet::getByteOffset() or LocationSet::getOffset() but I'm not sure if it is clean enough.

pietroborrello avatar Oct 09 '20 15:10 pietroborrello

Ok I have a working version of the locMM switch on the last commit now, but before issuing a PR I would like to understand something better:

Looking at the code it seems that the functionality of LocSymTableInfo::collectArrayInfo and LocSymTableInfo::collectStructInfo have been merged into SymbolTableInfo::collectArrayInfo and SymbolTableInfo::collectStructInfo that now fill the FieldInfo of the types both with field indexes and offsets. Indeed removing the methods from LocSymTableInfo and making it use the parent class ones seems to correctly fill the metadata.

The problem is that in various places in the code the LocationSet::getOffset() method is used, which returns the field index.

Examples of such usage are in PAG::addGepObjNode with:

    NodeID gepId = (ls.getOffset() + 1) * gepMultiplier + base;

or in PAG::getGepObjNode with:

    // Base and first field are the same memory location.
    if (FirstFieldEqBase && ls.getOffset() == 0) return base;

This makes the computations to reason only on the field index. Manually changing such occurrences to LocationSet::getByteOffset() makes the -locMM switch to work, but obviously would break "non locMM" compatibility.

Yes, this will break the non locMM. One way to do is to still invoke LocationSet::getOffset() but return its byte offset?

And in fact running the test suite, the following tests fail:

The following tests FAILED:
	  4 - basic_c_tests/array-constIdx.c (Child aborted)
	 56 - fs_tests/array_alias_1.c (Child aborted)
	 57 - fs_tests/array_alias_2.c (Child aborted)
	 59 - fs_tests/array_alias_4.c (Child aborted)
	 60 - fs_tests/array_alias_5.c (Child aborted)

As a reference the first one is reported:

#include "aliascheck.h"

struct MyStruct {
	int * f1;
	int * f2;
};

int main() {
	struct MyStruct s[2];
	int a,b;
	s[0].f1 = &a;
	s[1].f1 = &b;

	// Different fields of different elements in a 
	// certain array are treated as different objects.
	NOALIAS(s[0].f1, s[1].f2);
	MAYALIAS(s[0].f1, s[1].f1);

	return 0;
}

which fails on

[AndersenWPA] Checking MAYALIAS
	 FAILURE :MAYALIAS check <id:289, id:293> at ({  })

This should show that the 0th and 1st entry of the array are treated as a two different fields, as in locMM

Have you any advice on how to deal with that? I could add a method to SymbolTableInfo to provide a boolean locMM state, on which decide if the code should use LocationSet::getByteOffset() or LocationSet::getOffset() but I'm not sure if it is clean enough.

yuleisui avatar Oct 10 '20 12:10 yuleisui

The best solution is to initialized LocationSet::getByteOffset() and LocationSet::getOffset() properly using LocSymTableInfo, however, I guess there is one or two places as you mentioned, only used LocationSet::getOffset() to create GepObjNode, which makes LocMM and non-locMM hard to co-exist. Any idea? @mbarbar @pietroborrello

yuleisui avatar Oct 10 '20 12:10 yuleisui

I was thinking of adding a third method to LocationSet like LocationSet::getByteOrFieldOffset(bool returnByteOff) that returns either the byte or field offset based on the boolean flag, which can be constructed like bool returnByteOff = SVFUtil::isa<LocSymTableInfo>(SymbolTableInfo::Symbolnfo()) but it appears more as an hack to me, than a clean solution

pietroborrello avatar Oct 12 '20 11:10 pietroborrello

Moreover, I'm not sure if I can blindly substitute any occurrence of LocationSet::getOffset() with the method, or if there is any point in the code which relies on LocationSet::getOffset() regardless of the locMM flag

pietroborrello avatar Oct 12 '20 11:10 pietroborrello

I was thinking of adding a third method to LocationSet like LocationSet::getByteOrFieldOffset(bool returnByteOff) that returns either the byte or field offset based on the boolean flag, which can be constructed like bool returnByteOff = SVFUtil::isa<LocSymTableInfo>(SymbolTableInfo::Symbolnfo()) but it appears more as an hack to me, than a clean solution

LocSymTableInfo is a child class of SymbolTableInfo, I think it would be good to use polymorphism (making getOffset() as a virtual function) to return the corresponding byteoffset?

yuleisui avatar Oct 12 '20 12:10 yuleisui

LocSymTableInfo is a child class of SymbolTableInfo, I think it would be good to use polymorphism (making getOffset() as a virtual function) to return the corresponding byteoffset?

So do you mean creating a child class for LocationSet which returns the byteoffset when LocationSet::getOffset is called? The calls I'm referring to are in the PAG class and not in LocSymTableInfo nor SymbolTableInfo. Also, the problem is that the LocationSet is created outside of the SymbolTableInfo, for example in PAGBuilder::processCE, so would be unaware of which object should create. Maybe he could check the class of SymbolTableInfo::Symbolnfo(). Did I misunderstand something?

pietroborrello avatar Oct 12 '20 12:10 pietroborrello

LocationSet is used to maintain both field and byteoffset information, I do not think we need a child class of it. Maybe you can try to submit a pull request and we can start discussion from there?

yuleisui avatar Oct 12 '20 12:10 yuleisui