stata-parquet
stata-parquet copied to clipboard
Install stata-parquet via `conda`
Given that conda turns out to be best way to get this to work across computers, it would be great if users could install this via
conda install stata-parquet
The "easy" way to do this, without making a full package, is just to define the dependencies in a file. Then you can do conda env create -f environment.yml
Since the build artifacts really need to be compiled on each platform, it might be helpful for build/ to be in gitignore?
And the lib/plugin/?
Also, don't know if this is on purpose, but the make step creates a second set of build artifacts within build/parquet/:
(stata-parquet) kyle at desktop in .../stata/stata-parquet/build on master ✗ [83bd3b3] 22:29
> tree
.
├── changelog.md
├── parquet
│ ├── changelog.md
│ ├── parquet.ado
│ ├── parquet.pkg
│ ├── parquet.sthlp
│ ├── parquet_unix.plugin
│ └── stata.toc
├── parquet.ado
├── parquet.pkg
├── parquet.sthlp
├── parquet_tests.do
├── parquet_unix.plugin
└── stata.toc
1 directory, 13 files
Ah, that's intereting... Yes, in this case I won't be distributing the plugin with the package. Build is also where I put the ado and help files for install from github. I wonder if this will ever be installable that way... I gurss not. Mmm.
The parquet sub folder is for the zip step. Need to make sure it's not created if thst's not run.
I see, well maybe not the binary file at least. The repo is already 3.3 MB in size, at least that much when compressed in transfer from Github during git clone.
I did some pruning; the repo is down to 400KiB or so. The post where I got the commands said you should rebase and not merge or you could re-introduce some of the old files.
Great! Yeah, I did that.
@kylebarron Is there a way to fix the arrow/parquet version using conda? I assume so but I don't know what it is. This stopped compiling on my machine, which I assume has to do with versioning. The conda method still works, however. It'd be nice to fix it to a particular version so it doesn't break in the future...
You can set versions in environment.yml
https://github.com/kylebarron/cookiecutter-data-science/blob/master/%7B%7B%20cookiecutter.repo_name%20%7D%7D/environment.yml
Can you check your package versions? I'm planning to fix them like so:
name: stata-parquet
channels:
- conda-forge
dependencies:
- arrow==0.12.1
- arrow-cpp==0.11.1
- boost==1.68.0
- gcc==4.8.5
- parquet-cpp==1.5.1
I forgot my laptop at the office and so can't check.
I think arrow is completely unrelated and unnecessary. I think just arrow cpp is necessary
These versions match what I currently have:
dependencies:
- arrow-cpp==0.11.1
- boost==1.68.0
- gcc==4.8.5
- parquet-cpp==1.5.1
Did you get segfaults with arrow-cpp 0.12.0? Mohan did...
I installed python and pyarrow for my thing above. I tried using the plugin again and I get a symbol lookup error, but not segfaults. I thought arrow was unrelated and arrow-cpp was the relevant one?
(I'll try a clean install in a bit; he's also just on the aging servers, right?)
Yes, meant to write arrow-cpp.
He's currently working on nberX but yes
Just did a clean install on nber1. Segfault. weird. Will debug later.
I wonder if the segfault from conda is because the libraries, etc. were installed system-wide. I've been debugging, and it seems that the segfault appears whenever I try to write to an arrow table, both with the low level and high level reader.
One issue at a time: Even from conda I was not able to load the plugin on nber1. It seems the culprit was miniconda3/envs/stata-parquet/lib/libstdc++.so.6, which was linked to libstdc++.so.6.0.19, which lacks some symbols. If I re-link to libstdc++.so.6.0.24 then that goes through, but I get another error,
stata15-mp: symbol lookup error: ./parquet_unix.plugin: undefined symbol: _ZN5arrow5fieldERKSsRKSt10shared_ptrINS_8DataTypeEEbRKS2_IKNS_16KeyValueMetadataEE
The system-wide version is libstdc++.so.6.0.19 so maybe that has something to do with the crashes? The first error (libstdc++.so.6 being linked to an older library) I think arose because the system-wide version of conda that is installed is older, and I've been having issues using conda on the server...
@kylebarron Hey, since you're better than me at conda, is there a way to ignore the system-wide install/etc. when using conda? I installed miniconda on nber1 but it keeps wanting to use the system version, and if I try to call it via the full path miniconda3/bin/conda then it says that conda hasn't been set up. Have you had issues with conda on nberX? This is the msg btw
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with
$ echo ". /homes/nber/caceres/miniconda3/etc/profile.d/conda.sh" >> ~/.bashrc
or, for all users, enable conda with
$ sudo ln -s /homes/nber/caceres/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh
The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH. To do so, run
$ conda activate
in your terminal, or to put the base environment on PATH permanently, run
$ echo "conda activate" >> ~/.bashrc
Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file. You should manually remove the line that looks like
export PATH="/homes/nber/caceres/miniconda3/bin:$PATH"
^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^
Usually when I get that first error it's just that I need to use source activate instead. I put my local conda first on my path and I don't have any issues
Even when I install the dependencies for stata-parquet through my user-level Conda, and start stata with LD_LIBRARY_PATH=${CONDA_PREFIX}/lib xstata, Stata crashes during parquet save auto.parquet on nberX.
Did you remove the NBER version from your Stata path?
Good point. I removed the NBER version and then it says

You should try do run do parquet.ado and that should tell you the specific error.
/homes/nber/barronk/local/anaconda3/envs/stata-parquet/lib/libstdc++.so.6:
version `CXXABI_1.3.8' not found
(required by /homes/nber/barronk/local/anaconda3/envs/stata-parquet/lib/libarrow.so.11)
cd /homes/nber/barronk/local/anaconda3/envs/stata-parquet/lib/
ln -sf libstdc++.so.6.0.24 libstdc++.so
ln -sf libstdc++.so.6.0.24 libstdc++.so.6
libstdc++ points to libstdc++.so.6.0.19
(stata-parquet) barronk at nber5 in ~/local/anaconda3/envs/stata-parquet/lib 13:52
> ll libstdc++.so.6
lrwxrwxrwx 1 barronk barronk 19 Feb 15 13:29 libstdc++.so.6 -> libstdc++.so.6.0.19*
Right, which does not have all the symbols.
I upgraded conda and now libm and libc are incompatible, but I can't find a way to install glibc... Yay versioning problems.
So we need GCC >=4.9 somehow?