medspacy icon indicating copy to clipboard operation
medspacy copied to clipboard

Current QuickUMLS (simstring) has different bytes per character on Windows vs POSIX

Open burgersmoke opened this issue 2 years ago • 3 comments

At various times, @burgersmoke and @turbosheep have noticed that if you generate QuickUMLS resources on a POSIX system (i.e. MacOS, Linux, etc) and then try to load those resources for QuickUMLS extraction on a Windows machine (or vice versa), then the library will run and throw no exceptions but there will be no extraction.

Back in Dec 2020, @burgersmoke figured out that the version of simstring currently used by QuickUMLS makes some bad assumptions about bytes per character between Windows and POSIX. This leads to a situation where "each character on the Windows data is encoded with 2 bytes whereas the Mac version encodes each character as 4 bytes". @burgersmoke had to track this down with a hex editor to get this screenshot:

image

I'll post more technical details in the next comment, but the short summary is:

"The implementation of simstring looks different between the quickumls_simstring implementation from GeorgetownIR (QuickUMLS people) and the original."

burgersmoke avatar Jul 08 '22 18:07 burgersmoke

For more detail, below is what @burgersmoke figured out while comparing the versions of simstring between QuickUMLS and the original.

For example, the quickumls_simstring looks like this: (see: https://github.com/Georgetown-IR-Lab/simstring/blob/master/quickumls_simstring/export.cpp)

#if defined (WIN32)
#define __SIZEOF_WCHAR_T__ 2
#endif

and also

std::vector<std::string> reader::retrieve(const char *query)
{
    reader_type& dbr = *reinterpret_cast<reader_type*>(m_dbr);
    std::vector<std::string> ret;

    switch (dbr.char_size()) {
    case 1:
        retrieve_thru(dbr, query, this->measure, this->threshold, std::back_inserter(ret));
        break;
    case 2:
#if defined(__APPLE__) || defined(WIN32)
#if __SIZEOF_WCHAR_T__ == 2
        retrieve_iconv<wchar_t>(dbr, query, UTF16, this->measure, this->threshold, std::back_inserter(ret));
#else
assert(0);
#endif
#else
        retrieve_iconv<uint16_t>(dbr, query, UTF16, this->measure, this->threshold, std::back_inserter(ret));
#endif
        break;
    case 4:
#if defined(__APPLE__) || defined(WIN32)
#if __SIZEOF_WCHAR_T__ == 4
        retrieve_iconv<wchar_t>(dbr, query, UTF32, this->measure, this->threshold, std::back_inserter(ret));
#else
assert(0);
#endif
#else
        retrieve_iconv<uint32_t>(dbr, query, UTF32, this->measure, this->threshold, std::back_inserter(ret));
#endif
        break;
    }

    return ret;
}

Meanwhile, the current version of the "base" simstring implementation (later then the GeorgetownIR fork) looks like this (see: https://github.com/chokkan/simstring/blob/master/swig/export.cpp)

std::vector<std::string> reader::retrieve(const char *query)
{
    reader_type& dbr = *reinterpret_cast<reader_type*>(m_dbr);
    std::vector<std::string> ret;

    switch (dbr.char_size()) {
    case 1:
        retrieve_thru(dbr, query, this->measure, this->threshold, std::back_inserter(ret));
        break;
    case 2:
#if defined(__apple_build_version__)
        throw std::runtime_error("UTF16 not supported in macOS, due to compatibility issues with libc++.");
#else
        retrieve_iconv<uint16_t>(dbr, query, UTF16, this->measure, this->threshold, std::back_inserter(ret));
#endif
        break;
    case 4:
#if defined(__apple_build_version__)
        throw std::runtime_error("UTF32 not supported in macOS, due to compatibility issues with libc++.");
#else
        retrieve_iconv<uint32_t>(dbr, query, UTF32, this->measure, this->threshold, std::back_inserter(ret));
#endif
        break;
    }

    return ret;
}

burgersmoke avatar Jul 08 '22 18:07 burgersmoke

I really, really wanted to fix this one with our own medspacy fork of QuickUMLS 2.6 and with medspacy 1.0.0, but I was not able to fix this one yet.

burgersmoke avatar Oct 18 '22 22:10 burgersmoke

After we release medspacy 1.1.0 (hopefully by EOD today), I will re-test this with pysimstring to see if there is still a difference between Windows and other platforms where one writes as 2 bytes and the other as 4. If so, we can release just one version of QuickUMLS resources. It's unclear as I look at this if fixed in the export.cpp which is here:

https://github.com/percevalw/pysimstring/blob/master/pysimstring/export.cpp

burgersmoke avatar May 05 '23 21:05 burgersmoke