doctr
doctr copied to clipboard
[conda] Unable to make a conda build
Unfortunately, one of the project dependencies does not have any conda release or any way to make one. I opened an issue on their repo https://github.com/pymupdf/PyMuPDF/issues/938 to track this, but so far I haven't found any way to release the project on anaconda with this dependency.
Is it mandatory to support conda? If so, maybe we can switch to another pdf-reader lib.
Not mandatory but this is a very common installation mean for python package. We might investigate other options to replace the dependency but we'll have to check for performance drop first
For reference, my initial issue on PyMuPDF (https://github.com/pymupdf/PyMuPDF/issues/938) was moved to this discussion: https://github.com/pymupdf/PyMuPDF/discussions/1137
Why not simply do a
conda run pip install <missing in conda package>
Especially since there's is only one package
Hi @kchawla-pi,
So actually since then, there is also weasyprint that is missing a conda build. But it happens that I was thinking about getting back to the bottom of this yesterday. Worst case scenario, we'll make some features optional (such as HTML compatibility through weasyprint) so that the core build is available in conda.
Also please note that for now, the only important dependencies that would benefit from a conda support (performance-wise) are PyTorch & TensorFlow :+1:
Anyway, we'll provide some updates on this very topic soon!
So now with #829 we are just missing weasyprint, right @fg-mindee ?
So now with #829 we are just missing
weasyprint, right @fg-mindee ?
Nope, pypdfium2 also lacks support of a conda installation. But that could be fixed, I'll ping them about this!
However, having doctr.io.pdf and doctr.io.html as extras, would do the trick :+1:
And I think we should seriously consider that: especially for HTML, it's more about people in need of training data, so I would argue that most users don't benefit from weasyprint (which is a problem for MAC users also #815)
For PDFs, it's more important, so if we can get a conda build, our best course of action would probably be to move html/weasyprint to an extra! What do you think?
I just checked and weasyprint does have a conda build now :raised_hands: https://anaconda.org/conda-forge/weasyprint
(But I still think we should move it to extra builds)
Sorry about the conda build - I never used conda myself and currently don't have the time/interest to learn it. Due to platform-specific binaries, the setup infrastructure of pypdfium2 is fairly complex already.
Perhaps a developer who is more familiar with conda can look into this at some point. I'd be happy to take a Pull Request that adds conda packaging to the release workflow.
That said, is there any reason you can't use pip?
@frgfm
@mara004 what do you mean by "any reason you can't use pip?"
pip installation is already available 👍 but conda builds are more specific to a given environment, so it's good if we can offer that mean of installation as well. For the conda recipe, I don't know about options to use pip (I don't have experience with conda recipe building the C or C++ extensions of a python library though)
what do you mean by "any reason you can't use pip?"
I'm not familiar with the conda environment, so perhaps that was a silly question to ask. I basically meant: For what reason do we need an extra package on conda if the PyPI release can be used? As @kchawla-pi wrote:
Why not simply do a
conda run pip install <missing in conda package>
but conda builds are more specific to a given environment
I'd be curious to know in what way exactly conda builds are more specific?
I have read the comparison of conda to pip in Wikipedia, but the problem specified there can be solved with venv. pip allows dependency breakage, but very clearly warns about it, so I don't really see an issue in this regard...
Well, pip does not do sophisticated dependency resolution, unlike Conda. It's the same reason pipenv and poetry are used for package installations, but unlike Conda, they use PyPI's index. Each of these has their own algorithm for dependency resolution, with Pipenv being rather slow.
Conda is the defacto tool for data scientists in the Python ecosystem. Seamlessly using Mindee packages using Conda will solve a big paper cut.
Okay, thanks for pointing this out! To me personally, conda still seems kind of a reinvented wheel and duplicated packaging work, but if there are people who like it and use it I'm open to add support if someone can implement it properly.
I can definitely second @kchawla-pi on that: I always try to find a conda installation before using pip, because it's much more careful about your existing env compatibility 👍
I tried to craft a package with conda-build recently but I'm afraid it didn't go very well at all. I managed to build a package for my host platform (Linux x86_64) but it took unendurably long for conda-build to set up the environment and assemble the package (and while doing so, the directory where I installed miniconda grew well above 3 GiB 🙄). I hope there are ways to speed up the process of running conda-build...
Wow that must be so frustrating . I don't know about Conda packaging, but now I'm pissed at conda for making your job so difficult. I will try to take a gander at it in June.
Well, I don't know, perhaps I was just doing it the wrong way, but all the same it hasn't been very obvious to me how to do it.
In my experience conda build is always a long operation. Base conda is known to have a slow dep resolution procedure, so I personally use mamba (https://github.com/mamba-org/mamba) which is blazing fast for dep installation (multi-thread, rewritten in C++). I have to check if that extends to package building as well
I think the main problem is that, when running conda-build, it creates an isolated environment where all dependencies are installed. Now, if we want to craft more than one package, it would be essential that the environment can be reused so that dependencies don't need to be installed each time. Is there any option to do this?
@frgfm do you know an answer ? :sweat_smile:
Even if we can get around the duration problem, I'll still need information about conda platform tags. We need an equivalent for each of the tags shown on https://pypi.org/project/pypdfium2/#files (section "Built Distributions").
Alternatively, perhaps a conda package could just wrap pip install somehow?
The easiest case would be if there were some tool to automatically convert wheels to conda packages, but I doubt this exists.
For reference, these two pages sound interesting: https://docs.conda.io/projects/conda-build/en/latest/user-guide/recipes/build-without-recipe.html https://docs.conda.io/projects/conda-build/en/latest/user-guide/wheel-files.html
Found the platform identifiers. conda convert actually lists them: osx-64,osx-arm64,linux-32,linux-64,linux-ppc64,linux-ppc64le,linux-s390x,linux-armv6l,linux-armv7l,linux-aarch64,win-32,win-64,all
musllinux is missing but I guess this doesn't matter so much.
Hmm, I experimented locally and got relatively far but am now stuck with the problem that conda convert doesn't work for noarch packages, see https://github.com/conda/conda-build/issues/2611
I'm not very knowledgeable about conda convert command unfortunately!
Regarding the duration, I've had similar issues and there appears to be no workarounds (I'm waiting for mamba build, since mamba is a rewrite of conda but much faster & parallel processing)
What I think could be done relatively easily would be to add the required files for users to create/install a conda package locally.
For this, one could treat pypdfium2 just like any other noarch: python package, but then we obviously don't get uploadable artefacts.
With a rather dreadful workaround, I did manage to build architecture-specific conda packages locally, but the problem is that conda always ties them to a single python version, although this is not necessary for ABI-level bindings. But I'm definitely not willing to go the inelegant path of python-specific builds. IMO, conda builds for pypdfium2 are only feasible if conda is changed to allow python-independent architecture-specific builds somehow.
Assuming your work is open source, where can I see your recipe and efforts, to build off of?
On Wed, May 25, 2022, 22:33 mara004 @.***> wrote:
Well, I don't know, perhaps I was just doing it the wrong way, but all the same it hasn't been very obvious to me how to do it.
— Reply to this email directly, view it on GitHub https://github.com/mindee/doctr/issues/113#issuecomment-1137820615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6SXRHTRQNZ55EB7SLQAFDVL2FA3ANCNFSM4YYD44DQ . You are receiving this because you were mentioned.Message ID: @.***>
Hi @kchawla-pi 👋 , Yes the project is open source (Apache 2.0 License). What do you mean with recipe / efforts related to conda build ? Could you explain a bit more in detail what you need ? 😅
@felixdittrich92 I think @kchawla-pi meant me, right? pypdfium2 is open source indeed, but I deleted the conda branch out of sheer frustration. Some attempts I never even pushed. Anyway, the state wasn't good at all and you're probably better off starting from scratch if you want to make an attempt yourself.