medspacy
medspacy copied to clipboard
Current QuickUMLS (simstring) has different bytes per character on Windows vs POSIX
At various times, @burgersmoke and @turbosheep have noticed that if you generate QuickUMLS resources on a POSIX system (i.e. MacOS, Linux, etc) and then try to load those resources for QuickUMLS extraction on a Windows machine (or vice versa), then the library will run and throw no exceptions but there will be no extraction.
Back in Dec 2020, @burgersmoke figured out that the version of simstring
currently used by QuickUMLS
makes some bad assumptions about bytes per character between Windows and POSIX. This leads to a situation where "each character on the Windows data is encoded with 2 bytes whereas the Mac version encodes each character as 4 bytes". @burgersmoke had to track this down with a hex editor to get this screenshot:
I'll post more technical details in the next comment, but the short summary is:
"The implementation of simstring looks different between the quickumls_simstring implementation from GeorgetownIR (QuickUMLS people) and the original."
For more detail, below is what @burgersmoke figured out while comparing the versions of simstring between QuickUMLS and the original.
For example, the quickumls_simstring looks like this: (see: https://github.com/Georgetown-IR-Lab/simstring/blob/master/quickumls_simstring/export.cpp)
#if defined (WIN32)
#define __SIZEOF_WCHAR_T__ 2
#endif
and also
std::vector<std::string> reader::retrieve(const char *query)
{
reader_type& dbr = *reinterpret_cast<reader_type*>(m_dbr);
std::vector<std::string> ret;
switch (dbr.char_size()) {
case 1:
retrieve_thru(dbr, query, this->measure, this->threshold, std::back_inserter(ret));
break;
case 2:
#if defined(__APPLE__) || defined(WIN32)
#if __SIZEOF_WCHAR_T__ == 2
retrieve_iconv<wchar_t>(dbr, query, UTF16, this->measure, this->threshold, std::back_inserter(ret));
#else
assert(0);
#endif
#else
retrieve_iconv<uint16_t>(dbr, query, UTF16, this->measure, this->threshold, std::back_inserter(ret));
#endif
break;
case 4:
#if defined(__APPLE__) || defined(WIN32)
#if __SIZEOF_WCHAR_T__ == 4
retrieve_iconv<wchar_t>(dbr, query, UTF32, this->measure, this->threshold, std::back_inserter(ret));
#else
assert(0);
#endif
#else
retrieve_iconv<uint32_t>(dbr, query, UTF32, this->measure, this->threshold, std::back_inserter(ret));
#endif
break;
}
return ret;
}
Meanwhile, the current version of the "base" simstring implementation (later then the GeorgetownIR fork) looks like this (see: https://github.com/chokkan/simstring/blob/master/swig/export.cpp)
std::vector<std::string> reader::retrieve(const char *query)
{
reader_type& dbr = *reinterpret_cast<reader_type*>(m_dbr);
std::vector<std::string> ret;
switch (dbr.char_size()) {
case 1:
retrieve_thru(dbr, query, this->measure, this->threshold, std::back_inserter(ret));
break;
case 2:
#if defined(__apple_build_version__)
throw std::runtime_error("UTF16 not supported in macOS, due to compatibility issues with libc++.");
#else
retrieve_iconv<uint16_t>(dbr, query, UTF16, this->measure, this->threshold, std::back_inserter(ret));
#endif
break;
case 4:
#if defined(__apple_build_version__)
throw std::runtime_error("UTF32 not supported in macOS, due to compatibility issues with libc++.");
#else
retrieve_iconv<uint32_t>(dbr, query, UTF32, this->measure, this->threshold, std::back_inserter(ret));
#endif
break;
}
return ret;
}
I really, really wanted to fix this one with our own medspacy fork of QuickUMLS 2.6 and with medspacy 1.0.0, but I was not able to fix this one yet.
After we release medspacy 1.1.0 (hopefully by EOD today), I will re-test this with pysimstring
to see if there is still a difference between Windows and other platforms where one writes as 2 bytes and the other as 4. If so, we can release just one version of QuickUMLS resources. It's unclear as I look at this if fixed in the export.cpp
which is here:
https://github.com/percevalw/pysimstring/blob/master/pysimstring/export.cpp