stata-parquet icon indicating copy to clipboard operation
stata-parquet copied to clipboard

Install stata-parquet via `conda`

Open mcaceresb opened this issue 7 years ago • 36 comments

Given that conda turns out to be best way to get this to work across computers, it would be great if users could install this via

conda install stata-parquet

mcaceresb avatar Oct 31 '18 02:10 mcaceresb

The "easy" way to do this, without making a full package, is just to define the dependencies in a file. Then you can do conda env create -f environment.yml

kylebarron avatar Oct 31 '18 02:10 kylebarron

Since the build artifacts really need to be compiled on each platform, it might be helpful for build/ to be in gitignore?

And the lib/plugin/?

kylebarron avatar Oct 31 '18 02:10 kylebarron

Also, don't know if this is on purpose, but the make step creates a second set of build artifacts within build/parquet/:

(stata-parquet) kyle at desktop in .../stata/stata-parquet/build on master ✗    [83bd3b3]  22:29
> tree      
.
├── changelog.md
├── parquet
│   ├── changelog.md
│   ├── parquet.ado
│   ├── parquet.pkg
│   ├── parquet.sthlp
│   ├── parquet_unix.plugin
│   └── stata.toc
├── parquet.ado
├── parquet.pkg
├── parquet.sthlp
├── parquet_tests.do
├── parquet_unix.plugin
└── stata.toc

1 directory, 13 files

kylebarron avatar Oct 31 '18 02:10 kylebarron

Ah, that's intereting... Yes, in this case I won't be distributing the plugin with the package. Build is also where I put the ado and help files for install from github. I wonder if this will ever be installable that way... I gurss not. Mmm.

The parquet sub folder is for the zip step. Need to make sure it's not created if thst's not run.

mcaceresb avatar Oct 31 '18 02:10 mcaceresb

I see, well maybe not the binary file at least. The repo is already 3.3 MB in size, at least that much when compressed in transfer from Github during git clone.

kylebarron avatar Oct 31 '18 02:10 kylebarron

I did some pruning; the repo is down to 400KiB or so. The post where I got the commands said you should rebase and not merge or you could re-introduce some of the old files.

mcaceresb avatar Oct 31 '18 03:10 mcaceresb

Great! Yeah, I did that.

kylebarron avatar Oct 31 '18 03:10 kylebarron

@kylebarron Is there a way to fix the arrow/parquet version using conda? I assume so but I don't know what it is. This stopped compiling on my machine, which I assume has to do with versioning. The conda method still works, however. It'd be nice to fix it to a particular version so it doesn't break in the future...

mcaceresb avatar Feb 09 '19 14:02 mcaceresb

You can set versions in environment.yml https://github.com/kylebarron/cookiecutter-data-science/blob/master/%7B%7B%20cookiecutter.repo_name%20%7D%7D/environment.yml

kylebarron avatar Feb 09 '19 14:02 kylebarron

Can you check your package versions? I'm planning to fix them like so:

name: stata-parquet
channels:
    - conda-forge
dependencies:
    - arrow==0.12.1
    - arrow-cpp==0.11.1
    - boost==1.68.0
    - gcc==4.8.5
    - parquet-cpp==1.5.1

mcaceresb avatar Feb 09 '19 17:02 mcaceresb

I forgot my laptop at the office and so can't check.

I think arrow is completely unrelated and unnecessary. I think just arrow cpp is necessary

kylebarron avatar Feb 09 '19 18:02 kylebarron

These versions match what I currently have:

dependencies:
    - arrow-cpp==0.11.1
    - boost==1.68.0
    - gcc==4.8.5
    - parquet-cpp==1.5.1

kylebarron avatar Feb 12 '19 16:02 kylebarron

Did you get segfaults with arrow-cpp 0.12.0? Mohan did...

kylebarron avatar Feb 13 '19 17:02 kylebarron

I installed python and pyarrow for my thing above. I tried using the plugin again and I get a symbol lookup error, but not segfaults. I thought arrow was unrelated and arrow-cpp was the relevant one?

mcaceresb avatar Feb 13 '19 17:02 mcaceresb

(I'll try a clean install in a bit; he's also just on the aging servers, right?)

mcaceresb avatar Feb 13 '19 17:02 mcaceresb

Yes, meant to write arrow-cpp.

He's currently working on nberX but yes

kylebarron avatar Feb 13 '19 17:02 kylebarron

Just did a clean install on nber1. Segfault. weird. Will debug later.

mcaceresb avatar Feb 13 '19 17:02 mcaceresb

I wonder if the segfault from conda is because the libraries, etc. were installed system-wide. I've been debugging, and it seems that the segfault appears whenever I try to write to an arrow table, both with the low level and high level reader.

One issue at a time: Even from conda I was not able to load the plugin on nber1. It seems the culprit was miniconda3/envs/stata-parquet/lib/libstdc++.so.6, which was linked to libstdc++.so.6.0.19, which lacks some symbols. If I re-link to libstdc++.so.6.0.24 then that goes through, but I get another error,

stata15-mp: symbol lookup error: ./parquet_unix.plugin: undefined symbol: _ZN5arrow5fieldERKSsRKSt10shared_ptrINS_8DataTypeEEbRKS2_IKNS_16KeyValueMetadataEE

The system-wide version is libstdc++.so.6.0.19 so maybe that has something to do with the crashes? The first error (libstdc++.so.6 being linked to an older library) I think arose because the system-wide version of conda that is installed is older, and I've been having issues using conda on the server...

mcaceresb avatar Feb 15 '19 02:02 mcaceresb

@kylebarron Hey, since you're better than me at conda, is there a way to ignore the system-wide install/etc. when using conda? I installed miniconda on nber1 but it keeps wanting to use the system version, and if I try to call it via the full path miniconda3/bin/conda then it says that conda hasn't been set up. Have you had issues with conda on nberX? This is the msg btw

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with

    $ echo ". /homes/nber/caceres/miniconda3/etc/profile.d/conda.sh" >> ~/.bashrc

or, for all users, enable conda with

    $ sudo ln -s /homes/nber/caceres/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh

The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH.  To do so, run

    $ conda activate

in your terminal, or to put the base environment on PATH permanently, run

    $ echo "conda activate" >> ~/.bashrc

Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file.  You should manually remove the line that looks like

    export PATH="/homes/nber/caceres/miniconda3/bin:$PATH"

^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^

mcaceresb avatar Feb 15 '19 16:02 mcaceresb

Usually when I get that first error it's just that I need to use source activate instead. I put my local conda first on my path and I don't have any issues

kylebarron avatar Feb 15 '19 16:02 kylebarron

Even when I install the dependencies for stata-parquet through my user-level Conda, and start stata with LD_LIBRARY_PATH=${CONDA_PREFIX}/lib xstata, Stata crashes during parquet save auto.parquet on nberX.

kylebarron avatar Feb 15 '19 18:02 kylebarron

Did you remove the NBER version from your Stata path?

mcaceresb avatar Feb 15 '19 18:02 mcaceresb

Good point. I removed the NBER version and then it says image

kylebarron avatar Feb 15 '19 18:02 kylebarron

You should try do run do parquet.ado and that should tell you the specific error.

mcaceresb avatar Feb 15 '19 18:02 mcaceresb

/homes/nber/barronk/local/anaconda3/envs/stata-parquet/lib/libstdc++.so.6:
version `CXXABI_1.3.8' not found
(required by /homes/nber/barronk/local/anaconda3/envs/stata-parquet/lib/libarrow.so.11)

kylebarron avatar Feb 15 '19 18:02 kylebarron

cd /homes/nber/barronk/local/anaconda3/envs/stata-parquet/lib/
ln -sf libstdc++.so.6.0.24 libstdc++.so
ln -sf libstdc++.so.6.0.24 libstdc++.so.6

mcaceresb avatar Feb 15 '19 18:02 mcaceresb

libstdc++ points to libstdc++.so.6.0.19

(stata-parquet) barronk at nber5 in ~/local/anaconda3/envs/stata-parquet/lib                                                                                                                  13:52
> ll libstdc++.so.6
lrwxrwxrwx 1 barronk barronk 19 Feb 15 13:29 libstdc++.so.6 -> libstdc++.so.6.0.19*

kylebarron avatar Feb 15 '19 18:02 kylebarron

Right, which does not have all the symbols.

mcaceresb avatar Feb 15 '19 18:02 mcaceresb

I upgraded conda and now libm and libc are incompatible, but I can't find a way to install glibc... Yay versioning problems.

mcaceresb avatar Feb 15 '19 18:02 mcaceresb

So we need GCC >=4.9 somehow?

kylebarron avatar Feb 15 '19 18:02 kylebarron