kotaemon icon indicating copy to clipboard operation
kotaemon copied to clipboard

fix: update setup instructions

Open bfdykstra opened this issue 1 year ago • 17 comments

Description

  • Please include a summary of the changes and the related issue.

I added documentation to the readme that adds the steps I took to get the app running locally without docker.

I also edited the run_macos.sh script to include the installation of dependencies like homebrew and libmagic.

I added unstructured to the dependency array of kotaemon pyproject.toml

  • Fixes #138

Type of change

  • [x] New features (non-breaking change).
  • [x] Bug fix (non-breaking change).
  • [ ] Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

  • [ ] I have performed a self-review of my code.
  • [ ] I have added thorough tests if it is a core feature.
  • [x] There is a reference to the original bug report and related work.
  • [ ] I have commented on my code, particularly in hard-to-understand areas.
  • [ ] The feature is well documented.

bfdykstra avatar Aug 28 '24 20:08 bfdykstra

Thank you for the feedback @cin-albert ! I addressed your comments as best I could. Basically I think that the dependencies of unstructured need to be documented in order for any file upload and parsing to work properly.

Additionally, (and this could be a separate issue, does not need to be addressed here) I think it would be advantageous to include a clickable package or release that non technical users could run without opening their terminal.

bfdykstra avatar Aug 29 '24 16:08 bfdykstra

@bfdykstra the instruction ollama pull llama3.1:8b for model

but for LLM settings

api_key: ollama
base_url: http://localhost:11434/v1/
model: gemma2:2b (for llm) | nomic-embed-text (for embedding)

Better to show ollama pull gemma2:2b to keep it consistent ? perfer to give gemma2 as its smalest model to run on most lower end hardware.

Niko-La avatar Sep 04 '24 06:09 Niko-La

@bfdykstra the instruction ollama pull llama3.1:8b for model

but for LLM settings

api_key: ollama
base_url: http://localhost:11434/v1/
model: gemma2:2b (for llm) | nomic-embed-text (for embedding)

Better to show ollama pull gemma2:2b to keep it consistent ? perfer to give gemma2 as its smalest model to run on most lower end hardware.

@Niko-La throughout most of the docs it says to pull llama3.1:8b so there are inconsistencies everywhere. For example here it says to pull the llama3.1 and then immediately below it says to use gemma.... I want to keep the scope of this PR just to the initial readme and to fix the dependencies so that users can get started with default openai

bfdykstra avatar Sep 04 '24 15:09 bfdykstra

@lone17 or @cin-albert Is there anything I should change here?

bfdykstra avatar Sep 09 '24 16:09 bfdykstra

@bfdykstra, Sorry it took me so long to get back to you.

Agree with you that Unstructured should be treated as a mandatory dependency for Kotaemon and homebrew is used to install system requirements. However, according to Unstructured's document, we might need to install other system dependencies as well, such as libreoffice and tesseract. Could you help with this?

FYI, Kotaemon uses Unstructured for a wide variety of doc types. We'll need pyproject.toml to install extra required for those documents, for example, pip install unstructured[pdf] for PDFs https://github.com/Cinnamon/kotaemon/blob/73a476979e96b1475d409dfc7542fe0603856e44/libs/kotaemon/kotaemon/indices/ingests/files.py#L38-L54

cin-albert avatar Sep 11 '24 19:09 cin-albert

@bfdykstra @cin-albert cc @taprosoft Hi, I have some concerns about making unstructured a mandatory dependency. As it needs quite a few extra system dependencies, I think we should consider it as an advanced feature only. What I mean by that is:

  • not include it in the default dependencies, but instead in the adv section https://github.com/Cinnamon/kotaemon/blob/73a476979e96b1475d409dfc7542fe0603856e44/libs/kotaemon/pyproject.toml#L64-L73
  • not installing it in the run_*.sh scripts, accepting the fact that we cannot ship it with the non-tech installation route until we have something like #48
  • include it in the doc for developers
  • include it in the docker installation
  • automatically fall back to a different reader when unstructured is not installed, or gracefully show an error message https://github.com/Cinnamon/kotaemon/blob/73a476979e96b1475d409dfc7542fe0603856e44/libs/kotaemon/kotaemon/indices/ingests/files.py#L38-L54

lone17 avatar Sep 12 '24 03:09 lone17

Is the package as it stands right now runnable and/or working without unstructured installed?

If docker container install gets working from #274 then this PR may be redundant

My only goal here is to make it easier for people to get up and running :)

bfdykstra avatar Sep 12 '24 04:09 bfdykstra

If docker container install gets working from #274 then this PR may be redundant

My only goal here is to make it easier for people to get up and running :) Oh no it's not redundant since we still need to update the readme :)

I'm unsure about the docker installation, would need @phv2312's confirmation on this one.

Is the package as it stands right now runnable and/or working without unstructured installed?

It used to be. But I haven't tried it since the big v0.4.0 update, can you help confirm this point @cin-albert ?

lone17 avatar Sep 12 '24 04:09 lone17

@lone17

As it needs quite a few extra system dependencies, I think we should consider it as an advanced feature only

I agree that Unstructured is complicated to install in our bash/batch scripts because it requires some system dependencies and will involve system permission issues. However, if we omit unstructured in the installer, it could conflict with our original value of making Kotaemon friendly to non-dev users.

Yes, the best solution is to fall back to another indexing method in case Unstructured is not available, but for now, we have not prepared them so Unstructured is still an essential component.

In the short term, this is the simplest solution, which can satisfy the community, at least for MacOS users with homebrew that does not require sudo. In the long term, we are working on a better package management method (since we have many advanced features, and people tend to add more dependencies to this project), which could address the problem of Unstructured.

Nice to hear your perspectives.

@bfdykstra

Is the package as it stands right now runnable and/or working without unstructured installed?

Yes, Kotaemon can work without Unstructured, but you can only ingest .pdf, .html, .mhtml, and '.xlsx documents.

If docker container install gets working from https://github.com/Cinnamon/kotaemon/pull/274 then this PR may be redundant

I don't think your PR overlaps with the Docker update PR. There are still a large amount of users preferring the bash/batch installers over docker.

cin-albert avatar Sep 12 '24 08:09 cin-albert

@bfdykstra @lone17 After discussing with the maintenance team, I would like to change my opinion: Unstructured is still treated as optional because of the heavy installation. We will add methods to allow users to install advanced features in the future.

Sorry for the confusion, and thanks @bfdykstra for your contributions. Please update your PR accordingly, or if you think it's unnecessary now, I would like to close the PR

cin-albert avatar Sep 12 '24 09:09 cin-albert

Okay thank you for the careful consideration!

To summarize:

  1. I will revert the changes the to run_macos.sh script
  2. The unstructured dependency will go under the adv array in pyproject.toml
  3. In the Readme under "For developers" section, I will add in instructions and links to the system dependencies if they would like to process files other than .pdf, .html, .mhtml, and .xlsx documents.

Anything I missed or that you would add?

bfdykstra avatar Sep 12 '24 16:09 bfdykstra

@bfdykstra That's all we need for now. Also, please double-check the pre-commit style check failure

cin-albert avatar Sep 13 '24 09:09 cin-albert

@bfdykstra Thank you for your contribution and sorry for having you edit the run script in vain. 🙏🏼

lone17 avatar Sep 13 '24 11:09 lone17

@cin-albert @lone17 All set now - take a look?

bfdykstra avatar Sep 16 '24 00:09 bfdykstra

LGTM now, but let me fix the Lint check for commit messages before merge.

Btw, please ensure your secret is rotated.

cin-albert avatar Sep 17 '24 07:09 cin-albert

Thank you!

Yes I rotated it 🤦‍♂️That’s the danger of having .env tracked. Thank you for checking

On Tue, Sep 17, 2024 at 12:59 AM Quang (Albert) @.***> wrote:

LGTM now, but let me fix the Lint check for commit messages before merge.

Btw, please ensure your secret is rotated.

— Reply to this email directly, view it on GitHub https://github.com/Cinnamon/kotaemon/pull/144#issuecomment-2354814845, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACP3ZFW2P2V4FCB3VQ2TVUDZW7OPDAVCNFSM6AAAAABNJBDPOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJUHAYTIOBUGU . You are receiving this because you were mentioned.Message ID: @.***>

bfdykstra avatar Sep 17 '24 15:09 bfdykstra

LGTM. I don't have reviewer privilege though :))

lone17 avatar Sep 22 '24 16:09 lone17

Thanks all @bfdykstra @cin-albert @lone17. Sorry for the long wait.

taprosoft avatar Sep 29 '24 15:09 taprosoft