Windows Build
Is there any provision to build this on windows?
No word yet. I have no Windows machines and am likely not going to have the time/resources to work on a Windows build any time soon, but am happy to accept patches.
Alright, thanks :)
The sparkey and snappy dependencies were supposed to have been removed for v.1.0, which would make a Windows build a lot easier, but the sparkey source directory is still present. Is it no longer compiled when building?
Sparkey is still included in the source but it's simply ignored in the Makefile. Prefer not to delete it from the repo for now as the geodb module might be able to be reintroduced later (just not part of the main make all directive).
Got it, thanks. I am going to attempt to actually port to Windows, not merely build in MinGW. If I succeed, I'll write about it here.
I've changed my mind about porting - I've found out that it is often possible to link a Visual Studio project to .a files built with MinGW because C does not do name mangling (or rather, g++ and cl don't for C) and has a standard ABI. So I'm trying to go that route. I'll write about issues doing this here.
There is a POSIX function strndup() in a few places that needs to be replaced with something else since MinGW apparently doesn't have it. While doing this, I found in transliterate.c there is
str = strndup(str, len) in function transliterate.
Not only does this appear to be redundantly copying a string back into itself, but strndup does a malloc to create the buffer to copy into and returns a pointer to it, so wouldn't the original buffer pointed to by str then become a dangling pointer when str takes strndup's return value?
What should I put in place of this strdup?
It is not copying a string redundantly and that is not how passing pointers into functions works in C. After str = strndup(str, len), str is pointing to a newly allocated block of memory or NULL and doesn't modify the original pointer. Passing a pointer as a function argument in C means passing a copy of said pointer, so setting it to point to something else is limited to the scope of said function. To modify the original pointer in a way that's externally visible, it would need to be passed in as a double-pointer so the function can modify the contents of the memory address of the original. It's easy to verify with valgrind that there are no memory leaks or dangling pointers.
strndup is used because we may only be transliterating n bytes of the original string (strings may have multiple scripts, say a Han portion and a Latin portion which use different transliterators).
libpostal's config.h should define HAVE_STRNDUP as part of the configure step, so you can check for that and and define your own implementation (here's one for example: https://github.com/clibs/strndup/blob/master/strndup.c) if it is not available. Or it can be replaced by using libpostal's dynamic char_array with e.g. char_array_cat_len.
For adding a new C file, it would need to be added to the various _SOURCES lists in src/Makefile.am. For strndup, it might be sufficient to define something header-only like:
#ifndef STRDUP_H
#define STRDUP_H
#ifndef HAVE_STRNDUP
#define HAVE_STRNDUP
#include <stdlib.h>
#include <string.h>
static char *strndup(const char *s, size_t n)
{
char* new = malloc(n+1);
if (new) {
strncpy(new, s, n);
new[n] = '\0';
}
return new;
}
#endif /* HAVE_STRNDUP */
#endif /* STRDUP_H */
Then it can just be included as a header without needing to add to the Makefile. Or could put the inner ifdef directly in a header like string_utils.h. For simple functions like that, it might be worthwhile to just define them if they're not available.
For compilation, there are essentially 3 steps in an autotools project:
./bootstrap.sh: generatesconfigurescript fromconfigure.ac./configure: generatesMakefilefromMakefile.am, as well asconfig.hwhich specifies what's available on the system e.g. for conditional includesmake: actual compilation
Might need to run the first two steps before it picks up the drand48 addition.
AFAICT, projects that build on Windows usually use Visual Studio. One example is something like Google's gumbo-parser which appears to have an Appveyor build.
I can now successfully build libpostal.dll.a. Seems to me that now that I've got libpostal.dll.a I should be able to use the API. I want to know if any of the rest of the makefile besides the download step is necessary for using the API. In other words, can I skip from
noinst_PROGRAMS = ...
until
pkginclude_HEADERS = libpostal.h?
Some porting may be necessary to get the programs to build and being able to use the API is my primary concern right now.
Yes, it's possible to skip all of the noinst_PROGRAMS. Would need to port the download script libpostal_data for Windows as well. It's a *nix bash script using curl.
I actually don't need to port the download script. I'm building in MinGW, remember.
One thing I encountered is that we need to __declspec(dllexport) whatever functions are supposed to be exposed to client code. So far I've done DLLEXPORT for the ones that are necessary for a successful build. C doesn't have access specifiers but it does have static to make functions local to a translation unit so I'm guessing all non-static functions in files in /src should be exported?
Fairly sure it's just the functions in the public header libpostal.h
Had to add some DLLEXPORTs to the functions called in the test source files. I got build errors if I didn't. I'm leaving them for now, but we'll have to come up with something if these aren't supposed to be client-facing functions.
Now I've got a DLL and a program that successfully builds and links to it, but fails during libpostal_setup(). [31mERR[39m Error loading transliteration module, dir=(null) [90m at libpostal_setup_datadir (libpostal.c:1047) [94merrno: No such file or directory[39m
(I'm running a 64 bit DLL, which might explain the "interesting' characters)
I do have the /opt/libpostal_data/libpostal directory with the correct subfolders and downloaded .dat files. It may not be in the right place, though - I had to move it to its current location because the script making this directory and downloading to there actually treats the MSYS2 home directory as if it were the home directory of the machine. So that will have to be edited.
Some of the unit tests are for internal functions not exposed in the public header file (sometimes even for functions not exposed in an internal header file). One way to test functions like that, and also avoid adding dependencies to a test Makefile, is to simply #include the .c files. Not sure what effect that would have on Windows, but might need a few extra DLLEXPORTs.
For the setup issue, you'll need to define LIBPOSTAL_DATA_DIR during compilation (quoted as a C string literal, see below). In the autotools setting, we use whatever the user specified with --datadir (an existing autoconf variable, default is /usr/local/share on most *nix systems) and add a subdirectory called "libpostal". So if I specified --datadir=/data, the fully-qualified data dir would be "/data/libpostal".
This is implemented with CFLAGS in Makefile.am:
CFLAGS_BASE = -Wall -Wextra -Wno-unused-function -Wformat -Werror=format-security -Winit-self -Wno-sign-compare -DLIBPOSTAL_DATA_DIR='"$(datadir)/libpostal"' -g $(CFLAGS_CONF)
The Windows executable (built in Visual Studio) will think that /opt/libpostal_data is in the same directory as itself. But copying the data directory to the executable's directory isn't helping and unfortunately, the DLL built in MinGW has no debug information (Windbg can't handle debug info built in gcc) so I can't step through at runtime to find out exactly where the system is looking.
Oddly changing the build script to do
mkdir -p /c/libpostal/libpostal_data
./configure --datadir=/c/libpostal/libpostal_data
did place the data directory in C:\libpostal\libpostal_data and downloaded to it, but it also had the effect that libpostal_data and libpostal-1.dll were built in
mingw64/bin
and libpostal.a, libpostal.dll.a, and libpostal.la were built in in mingw64/lib and they weren't updated in the directory where the Makefile is. it also did not compile/link anything and went straight to the download step.
Still can't load transliteration module.
I'm really not versed in MinGW or Windows compilation at all, so it might be better to ask some of these questions on a MinGW/Windows-specific forum. Does it convert the forward slashes in the path to backslashes?
If that's the case, it's possible that MinGW only does that conversion for path arguments like --prefix and --datadir. If you're using autoconf/autotools and the original Makefile.am, try searching the generated Makefile for "LIBPOSTAL_DATA_DIR". Just speculating, but it could be that the forward slash in the CFLAGS argument is the problem. You'll have to look at the final compiler commands issued to know for sure. The setup functions depend on LIBPOSTAL_DATA_DIR being defined and a valid directory that can be read by the current user. You could also try removing the "-DLIBPOSTAL_DATA_DIR=..." from the CFLAGS in Makefile.am and hardcoding a value in config.h or libpostal.h just to see if you can get it working.
You're right, the issue had something to do with the "-DLIBPOSTAL_DATA_DIR=..." from the CFLAGS in Makefile.am. The problem is that even if the slashes are converted, the right separator on Windows is a double backslash, which "escapes" to a single slash.
Right now I am "fixing" this by hardcoding my path in libpostal_config.h:
#define LIBPOSTAL_DATA_DIR "C:\\libpostal\\libpostal_data\\libpostal"
(actually using four-slash separators, see below)
For a real fix, it looks like we can just write the path as above for "-DLIBPOSTAL_DATA_DIR=..." in the makefile. Alternatively, define another autoconf variable, say $(datadir2), that is the same as $(datadir) except it has double backslashes and use that in the makefile in place of $(datadir). But I haven't tried these fixes yet.
With this, libpostal_setup() completes normally but libpostal_setup_parser() chokes.
It looks like the PATH_SEPARATOR macro '\\' turns into a single backslash, which is wrong when using fopen. So really fopen should get double backslashes, which means the macros should have quadruple backslashes.
~~However, even definingPATH_SEPARATOR as "\\\\" and LIBPOSTAL_DATA_DIR with four-slash separators still leads to failure even though the file path strings now have two-slash separators (which escape to single slashes) and even though I am able to open files in the parser directory.
For example, in my driver program I can do~~
FILE *file;
file = fopen("C:\\libpostal\\libpostal_data\\libpostal\\address_parser\\address_parser_vocab.trie", "r");
if (file) ....
~~But trie_load() in the setup_address_parser() call heirarchy fails even though the argument to its fopen call is exactly the same.~~
~~That's weird.~~
UPDATE: fopen is fine. The issue is really a fread failure. See next post.
Upon closer examination, I misdiagnosed this. fopen() works. By the way, a forward slash as a path delimiter works on Windows, too.
What's actually happening during the loading of address_parser_vocab.trie is that the call to trie_load(vocab_path) in address_parser_load returns NULL because the call to if (!file_read_chars(file, (char *)alphabet, alphabet_size)) in trie_read fails. file_read_chars is implemented in terms of the C stdlib function fread. Adding print statements and checking feof reveals that fread in this particular instance is supposed to read 256 characters but prematurely reaches EOF at only 227. I don't know why. This function call - the one expecting to read 256 characters - happens several times before this through different code paths without issue. I used a fresh download of the file to make sure my copy wasn't corrupt.
@BenK10 here are the expected sizes for each file in the data dir for the current model (the sizes will change from build to build):
| file | size |
|---|---|
| address_expansions/address_dictionary.dat | 8593456 |
| address_parser/address_parser_crf.dat | 1014429504 |
| address_parser/address_parser_phrases.dat | 137484586 |
| address_parser/address_parser_postal_codes.dat | 621588636 |
| address_parser/address_parser_vocab.trie | 99219597 |
| language_classifier/language_classifier.dat | 77823270 |
| numex/numex.dat | 396966 |
| transliteration/transliteration.dat | 19570085 |
Can you confirm that all of the files are the correct size?
Yes, they are the correct size.
I've found the problem. Get this:
There are 2 calls to file_read_uint32 before the call to file_read_chars.
Those two calls each read 4 bytes, as shown in this line in the function definition for file_read_uint32:
if (fread(buf, 4, 1, file) == 1)
In address_parser_vocab.trie, a hex editor shows that there is a 0x1a character at byte 235.
That character is an EOF character in Windows (this dates back to the CP/M operating system, of which DOS was a successor).
We're now 8 bytes in, and 235-8 = 227, which explains why fread stopped after 227 bytes.
The call to fopen in trie_load is in "text mode":
file = fopen(path, "r");
Using fopen in binary mode fopen(path, "rb") here and in address_parser_load fixes this
and now it WORKS!
I'll post the changes and a readme soon. thanks for helping me through this!
Get the changed files and readme here: https://github.com/BenK10/libpostal_windows
You might want to make a separate branch. I didn't put all the source in my repo, so instead of forking my repo, copy master into another branch and then merge my changes into that branch. There are still some extra DLLEXPORTs in some files.
I hope this is useful for Windows users.
@BenK10 you are now the hero my coworkers have been waiting for.
@janviparikh Try installing Docker for windows and run an instance of Docker container.
Hello everyone!
I'm digging up the subject. Recently, I tried to compile the project under Windows without going through MSys2 and here are the things I remark:
- Build system (autoconf) is highly specific to posix environment. I rewrote part with CMake, but I know this is not ideal since you either need to install it on the system you're building for or either you'll need a solid C++ compiler if you don't have pre-built binaries for it. And I don't think that supporting two build systems is a good idea either ...
- Libpostal uses the GCC extension on the inline keyword in definition files GNU89 vs C99, which is a problem when trying to compile with a another compiler such as clang or msvc. We could solve this issue by introducing a macro depending of the compiler used.
- MSVC support of C99 is "partial", and specifically, the VLA are not supported from scratch. There are fake symbols defined (strdup) and some extensions of unix are also not available as such (strncasecmp). So I don't think we could ever use MSVC ...
- Platform specific header, there are "unistd.h", "sys/*.h" which seemed to be not really required ? The main issues were regarding: "direct.h" (for which there are Windows implementation) and the scanner which defines symbol through macros which collide to (shitty) Windows header (but if you rearrange the header, the issue can be overcome).
- utf8proc you'll need to update it to a more recent version.
Otherwise, it can compile fine with Clang on Windows: https://github.com/Gawaboumga/libpostal
Would you be ready to make some changes to the project (especially in regard to the build system) to support Clang and Windows?
Yours sincerely.
@Gawaboumga Do you have windows build instructions for your fork?
To answer "simply", just create a folder "build", put yourself in it and do "cmake .. -T ClangCL", it will generate the VS solution using clang compiler.
But there is NO guarantee that the behavior is the one expected since this code was intentionally written to work only on posix. At first glance, the behaviors seem consistent, but I don't think it's production-ready. The easiest solution is to go and get the artifacts generated by the CICD, if you don't need to modify the code.
if you are looking for an unofficial python binding in windows, please use this python package. It can work very well in windows
https://github.com/selva221724/pypostalwin