codealignment icon indicating copy to clipboard operation
codealignment copied to clipboard

In Notepad++, Code Alignment does not work on Unicode Glyphs

Open cpmcgrath opened this issue 10 years ago • 6 comments

UNICODE GLYPH POINTS
… Elipsis.Alt +0133
⸘⸘ Upside-Down Interrobang.Alt +2E18
test test.test
test0 test0.test0

Aligning by space on the above causes the second line to align 2 short, and the third line to be 4 short. So each glyph is being treated as 3 characters long instead of one.

Worth noting: It seems like some fixed width fonts don't count glyphs as needing to conform to the width.

If possible this should be fixed.

cpmcgrath avatar Jan 12 '15 21:01 cpmcgrath

I've a feeling this is actually a limitation of Notepad++ API or the c# wrapper I use to access it.

My logic is pretty simple, Line.cs' Text property makes the call... https://github.com/cpmcgrath/codealignment/blob/master/CodeAlignment.Npp/Implementations/Line.cs

public string Text
{
    get
    {
        var start   = (int)Win32.SendMessage(m_docPointer, SciMsg.SCI_POSITIONFROMLINE,   m_lineNo, 0);
        var end     = (int)Win32.SendMessage(m_docPointer, SciMsg.SCI_GETLINEENDPOSITION, m_lineNo, 0);
        var builder = new StringBuilder(end - start + 1);
        Win32.SendMessage(m_docPointer, SciMsg.SCI_GETLINE, m_lineNo, builder);
        return builder.ToString();
    }
}

And the Imports are in NppPluginNETHelper.cs https://github.com/cpmcgrath/codealignment/blob/master/CodeAlignment.Npp/NppPluginNETHelper.cs

[DllImport("user32")]
public static extern IntPtr SendMessage(IntPtr hWnd, SciMsg Msg, int wParam, [MarshalAs(UnmanagedType.LPStr)] StringBuilder lParam);

The MarshalAs might be what's causing the problem.

cpmcgrath avatar Jan 12 '15 22:01 cpmcgrath

Still to test, but I think it could be as simple as changing [MarshalAs(UnmanagedType.LPStr)] to [MarshalAs(UnmanagedType.LPWStr)]

cpmcgrath avatar Jan 12 '15 22:01 cpmcgrath

No that makes it return jibberish

cpmcgrath avatar Jan 12 '15 22:01 cpmcgrath

The rules to detect Unicode seem quite simple, but I'm doing something wrong. SCI_GET_LINE returns a stringbuilder where each character represents a byte. The rules for detecting Unicode is the first byte is between 0xC0 and 0xFD. Subsequent bytes will be between 0x80 and 0xBF

But for the above elispse (…) when I look at a file with it in binary viewer the codes are 0xE2 0x80 0xA6 but the codes passed to me are 0xE2 0xAC 0xA6 the fact that I can detect the correct number of bytes in the character should be enough to fix this, but I don't know if I'm comfortable releasing it like that.

cpmcgrath avatar Jan 30 '15 04:01 cpmcgrath

c# has got Encoding.Unicode.GetString(byte[]) but it was just giving garbage to me.

cpmcgrath avatar Jan 30 '15 04:01 cpmcgrath

Another example:

Hystérie Connective=3:09
Ghetto=2:41
Clé De Contact=2:50

after aligning by equals it becomes

Hystérie Connective =3:09
Ghetto               =2:41
Clé De Contact      =2:50

I guess you haven't done Unicode normalization. Just normalize to NFC and some issues will be fixed

phuclv90 avatar Sep 24 '20 02:09 phuclv90