llvmcpy Functions with string length arguments are not correctly wrapped

the C APIs for e.g. MDStrings and attribute kinds use explicit string lengths.

For some functions this only makes the use inconvenient:

llvm.get_enum_attribute_kind_for_name(name, len(name.encode('utf-8')))
llvm.get_md_kind_id(name, len(name.encode('utf-8')))
llvm.md_string(content, len(content.encode('utf-8')))

For some it makes correct use impossible if the value contains null bytes, as the C functions return the length of the result string in a size_t* or unsigned* argument.

llvm.get_md_string(…): const char* (LLVMValueRef c, size_t *Length)
llvm.get_string_attribute_kind(…): const char* (LLVMAttributeRef A, unsigned *Length)

Apr 10 '17 12:04 cgrenz

For the first problem, it's not always possible to safely understand that the argument coming after a char * is the length of the previous argument. In some cases we detect that, but not in all of them. I'll try and see if we can come up with a safe way to include those in your example.

More or less same argument for the second problem, but this is more serious.

I'll try to come up with a solution.

Thanks for reporting this.

Apr 10 '17 14:04 aleclearmind

The only way I see is to add special cases in the binding creation process for handling char * with a defined length. To be precise, I see two cases we need to handle:

The first case is about functions taking a char * parameter and the next parameter is a unsigned or size_t, which is the length of the string. Unfortunately, this pattern is ambiguous:

LLVMValueRef LLVMAddGlobalInAddressSpace(LLVMModuleRef M, LLVMTypeRef Ty,
                                         const char *Name,
                                         unsigned AddressSpace) { ...

Here AdressSpace is not the length but unrelated to Name. Therefore, I suggest generating Python bindings with a default value None and handling for passing the length of the string if the value was not overwritten. Thus, the interface is consistent but easier to use. Example:

def get_md_kind_id(arg0, arg1=None):
    """See LLVMGetMDKindID"""
    if arg1 is None:
        arg1 = len(arg0.encode('utf-8'))
    return libLLVM39.LLVMGetMDKindID(arg0.encode("utf-8"), arg1)

The second case is about returning the length of the character array through a pointer to an integer. Here, for each function returning a char *, we can look for a unsigned * or size_t * containing the length of the string returned.

What do you think of these proposed changes?

Apr 18 '17 12:04 hperl

I like 1), however what do we do if the length is not the last argument? We should provide default values for all the arguments after it too, and the user should specify None, which is kinda annoying. Another option I had in mind was to have multiple function overloads with different suffixes, but it's not a super-elegant solution either.

Another option would be to make a guess based on the argument name. For instance the second argument of LLVMGetMDKindID is called SLen. I planned to implement something like this but cffi doesn't provide it, however since now we directly import pycparser it might be doable.

For 2), looks good, we should just double check we don't have false positives in this case too.

Apr 18 '17 13:04 aleclearmind

Concerning 1), it is always the case that the integer directly follows the char *, it is not necessarily the last argument. However, we only need to set one parameter to None.

Apr 19 '17 10:04 hperl

@cgrenz could you please update this issue stating what still doesn't work or works in a suboptimal way after e27b4547559d9cbd89a905f422cf66c4a98ca7c1?

May 20 '19 10:05 aleclearmind