opendal icon indicating copy to clipboard operation
opendal copied to clipboard

Dotnet binding fails on windows

Open fallenwood opened this issue 1 year ago • 17 comments

Dotnet binding fails on windows, and passed on linux with net8

→ dotnet test  .\DotOpenDAL.Tests\
  Determining projects to restore...
  All projects are up-to-date for restore.
  DotOpenDAL -> \opendal\incubator-opendal\bindings\dotnet\DotOpenDAL\bin\Debug\net8.0\DotOpenDAL.dll
  DotOpenDAL.Tests -> \opendal\incubator-opendal\bindings\dotnet\DotOpenDAL.Tests\bin\Debug\net8.0\DotOpenD
  AL.Tests.dll
Test run for \opendal\incubator-opendal\bindings\dotnet\DotOpenDAL.Tests\bin\Debug\net8.0\DotOpenDAL.Tests.dll (.NETCoreApp,Version=v8.0)
Microsoft (R) Test Execution Command Line Tool Version 17.8.0 (x64)
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
[xUnit.net 00:00:00.32]     DotOpenDAL.Tests.BlockingOperatorTest.TestReadWrite [FAIL]
  Failed DotOpenDAL.Tests.BlockingOperatorTest.TestReadWrite [72 ms]
  Error Message:
   Assert.NotEqual() Failure
Expected: Not 0
Actual:   0
  Stack Trace:
     at DotOpenDAL.Tests.BlockingOperatorTest.TestReadWrite() in \opendal\incubator-opendal\bindings\dotnet\DotOpenDAL.Tests\BlockingOperatorTest.cs:line 29
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

Failed!  - Failed:     1, Passed:     0, Skipped:     0, Total:     1, Duration: < 1 ms - DotOpenDAL.Tests.dll (net8.0)

When passing a C# string to c_char, it's UTF16 internal on windows, and then fail to parse in rust part

fallenwood avatar Jan 10 '24 09:01 fallenwood

Invite @tisonkun join the discussion.

Xuanwo avatar Jan 10 '24 09:01 Xuanwo

Does dotnet has native types like Vec<u8> or bytes? Maybe we should provide correct API instead string at dotnet side?

Xuanwo avatar Jan 10 '24 09:01 Xuanwo

Does dotnet has native types like Vec<u8> or bytes? Maybe we should provide correct API instead string at dotnet side?

With a workaround, the example test can be fixed with

    public BlockingOperator()
    {
        var ptr = Marshal.StringToHGlobalAnsi("memory");
        Op = blocking_operator_construct(ptr);
    }

    [DllImport(
        "opendal_dotnet",
        EntryPoint = "blocking_operator_construct",
        CallingConvention = CallingConvention.Cdecl,
        CharSet = CharSet.Auto)]
    private static extern IntPtr blocking_operator_construct(IntPtr scheme);

But it works for ansi strings only, and for non-ansi (e.g. Simplified Chinese), the original version also fails with linux

fallenwood avatar Jan 10 '24 09:01 fallenwood

But it works for ansi strings only, and for non-ansi (e.g. Simplified Chinese), the original version also fails with linux

Oh, that's unexpected. Our public API should accept any bytes [u8] instead of just UTF-8 or ASCII.

Xuanwo avatar Jan 10 '24 09:01 Xuanwo

But it works for ansi strings only, and for non-ansi (e.g. Simplified Chinese), the original version also fails with linux

Oh, that's unexpected. Our public API should accept any bytes [u8] instead of just UTF-8 or ASCII.

Apologize, the Chinese character does work as I passed it to a wrong place.

fallenwood avatar Jan 10 '24 09:01 fallenwood

With Marshal.StringToCoTaskMemUTF8 and Marshal.PtrToStringUTF8 it works on both windows and linux. but I am not sure with performance

code
public class BlockingOperator
{
    public IntPtr Op { get; }

    public BlockingOperator()
    {
        var scheme = Marshal.StringToCoTaskMemUTF8("memory");
        Op = blocking_operator_construct(scheme);
    }

    public void Write(string path, string content)
    {
        var pathPtr = Marshal.StringToCoTaskMemUTF8(path);
        var contentPtr = Marshal.StringToCoTaskMemUTF8(content);
        blocking_operator_write(Op, pathPtr, contentPtr);
    }

    public string Read(string path)
    {
        var pathPtr = Marshal.StringToCoTaskMemUTF8(path);
        var ptr = blocking_operator_read(Op, pathPtr);
        return Marshal.PtrToStringUTF8(ptr)!;
    }

    [DllImport(
        "opendal_dotnet",
        EntryPoint = "blocking_operator_construct",
        CallingConvention = CallingConvention.Cdecl,
        CharSet = CharSet.Auto)]
    private static extern IntPtr blocking_operator_construct(IntPtr scheme);

    [DllImport(
        "opendal_dotnet",
        EntryPoint = "blocking_operator_write",
        CallingConvention = CallingConvention.Cdecl,
        CharSet = CharSet.Auto)]
    private static extern void blocking_operator_write(IntPtr op, IntPtr path, IntPtr content);

    [DllImport(
        "opendal_dotnet",
        EntryPoint = "blocking_operator_read",
        CallingConvention = CallingConvention.Cdecl,
        CharSet = CharSet.Auto)]
    private static extern IntPtr blocking_operator_read(IntPtr op, IntPtr path);
}

full code at https://github.com/fallenwood/incubator-opendal/pull/1 as draft

fallenwood avatar Jan 10 '24 13:01 fallenwood

Marshal.StringToCoTaskMemUTF8 would be more safety for this circumstance

StringToCoTaskMemUTF8 is useful for custom marshaling or for use when mixing managed and unmanaged code. Because this method allocates the unmanaged memory required for a string including a null terminator, always free the memory by calling Marshal.FreeCoTaskMem. This method provides the opposite functionality of Marshal.PtrToStringUTF8. The characters of the string are copied as UTF-8 characters.

PtrToStringUTF8 is useful for custom marshaling or for use when mixing managed and unmanaged code. Because this method creates a copy of the unmanaged string's contents, you must free the original string as appropriate. This method provides the opposite functionality of the Marshal.StringToCoTaskMemUTF8 methods.

Zheaoli avatar Jan 10 '24 17:01 Zheaoli

With Marshal.StringToCoTaskMemUTF8 and Marshal.PtrToStringUTF8 it works on both windows and linux. but I am not sure with performance

Maybe we should use byte[] in our .NET API instead of string? I don't think it's correct to use UTF8 convert here.

Xuanwo avatar Jan 10 '24 17:01 Xuanwo

Maybe we should use byte[] in our .NET API instead of string? I don't think it's correct to use UTF8 convert he

I think the UTF8 here is enough for .NET binding. Here's the reason:

The core problem here is that we don't have a dot net binding SDK here. So we need to process the DLL call convension. We can take the JNI binding sdk as a example

impl<'local, 'other_local: 'obj_ref, 'obj_ref> JavaStr<'local, 'other_local, 'obj_ref> {
    /// Get a pointer to the character array beneath a [JString]
    ///
    /// The string will be `NULL` terminated and encoded as
    /// [Modified UTF-8](https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8) /
    /// [CESU-8](https://en.wikipedia.org/wiki/CESU-8).
    ///
    /// The implementation may either create a copy of the character array for
    /// the given `String` or it may pin it to avoid it being collected by the
    /// garbage collector.
    ///
    /// Returns a tuple with the pointer and the status of whether the implementation
    /// created a copy of the underlying character array.
    ///
    /// # Warning
    ///
    /// The caller must release the array when they are done with it via
    /// [Self::release_string_utf_chars]
    ///
    /// # Safety
    ///
    /// The caller must guarantee that the Object passed in is an instance of `java.lang.String`,
    /// passing in anything else will lead to undefined behaviour (The JNI implementation
    /// is likely to crash or abort the process).
    unsafe fn -get_string_utf_chars(
        env: &JNIEnv<'_>,
        obj: &JString<'_>,
    ) -> Result<(*const c_char, bool)> {
        non_null!(obj, "get_string_utf_chars obj argument");
        let mut is_copy: jboolean = 0;
        let ptr: *const c_char = jni_non_null_call!(
            env.get_raw(),
            GetStringUTFChars,
            obj.as_raw(),
            &mut is_copy as *mut _
        );

        let is_copy = is_copy == JNI_TRUE;
        Ok((ptr, is_copy))
    }
}

So for me ,the UTF8 LGTM

Zheaoli avatar Jan 10 '24 17:01 Zheaoli

I think the UTF8 here is enough for .NET binding.

Can you elaborate more about this? We need to support writing data like png, pdf and parquet. Requiring users to put valid utf-8 here doesn't make sense.

Did I misunderstand something here? I also don't understand what's the relationship with java binding or jni.

Xuanwo avatar Jan 10 '24 18:01 Xuanwo

I think the UTF8 here is enough for .NET binding.

Can you elaborate more about this? We need to support writing data like png, pdf and parquet. Requiring users to put valid utf-8 here doesn't make sense.

Did I misunderstand something here? I also don't understand what's the relationship with java binding or jni.

For this issue, I think we just discuss the behavior of the string in C# or dot net platform. About string behavior, the encoding will affect all the positions that we are using a string like path, content. Which is means that the error will also be raised if the people use the Chinese in the path when we don't encode it. So I recommend UTF8 it in the previous comment.(For personal idea, I think the content is not the root cause for this issue, the string is. If we do not correct it, the dotnet binding will still unuseable in the Windows)

About the future, I agree that we need byte array for content but I think it should be in a split issue or PR that's why I'm not talking about it in this issue or I think keep content as string is enough in this issue (the begin of the issue is that we can not pass the Op = blocking_operator_construct("memory");, I think it's not about content)

Zheaoli avatar Jan 10 '24 19:01 Zheaoli

@Zheaoli, thank you for your explanation.

I mixed different APIs during the discussion of this issue. Let me make them more clean:

  • scheme in blocking_operator_construct: must be ascii like s3, memory. No other possible values.
  • path in blocking_operator_write: must be valid UTF-8 string.
  • content in blocking_operator_write: must be bytes. (I'm not sure how it represents in .NET)

Xuanwo avatar Jan 11 '24 00:01 Xuanwo

scheme in blocking_operator_construct: must be ascii like s3, memory. No other possible values. path in blocking_operator_write: must be valid UTF-8 string.

If we want to keep the same behavior for both windows and linux, we need to use 'Marshal.StringToCoTaskMemUTF8' manually. It's the recommended way from official.

Zheaoli avatar Jan 11 '24 01:01 Zheaoli

content in blocking_operator_write: must be bytes. (I'm not sure how it represents in .NET)

Agree,I will open other issue and take handle of it

Zheaoli avatar Jan 11 '24 01:01 Zheaoli

If we want to keep the same behavior for both windows and linux, we need to use 'Marshal.StringToCoTaskMemUTF8' manually. It's the recommended way from official.

Okay, I understand: Even if it contains only ASCII characters, we still need to ensure that it's encoded in UTF-8.

Xuanwo avatar Jan 11 '24 01:01 Xuanwo

Okay, I understand: Even if it contains only ASCII characters, we still need to ensure that it's encoded in UTF-8.

Yes, the encoding in windows world is always hard.

Zheaoli avatar Jan 11 '24 01:01 Zheaoli

Okay, I understand: Even if it contains only ASCII characters, we still need to ensure that it's encoded in UTF-8.

Yes, the encoding in windows world is always hard.

Just wondering if std::ffi::OsStr would work to simplify this, unfortunately I haven't used it before

fallenwood avatar Jan 11 '24 01:01 fallenwood