guardrails icon indicating copy to clipboard operation
guardrails copied to clipboard

[feat] Distribute NLTK tokenizers used in the core package

Open CalebCourier opened this issue 1 year ago • 4 comments

Description [Add a description of the feature] Since we now required nltk and the punkt tokenizer during the validation loop for chunking during streaming, we should either download and distribute the punkt tokenizer with the library or find a way to include it during the install phase. From what I can see the only way to perform a post-install flow is if we switch back to setuptools instead of poetry, but even that may not work for all distribution methods.

Why is this needed [If you have a concrete use case, add details here.] Currently we download the tokenizer, if it dosen't exist, during runtime which can cause issues in certain environments like kubernetes. See https://github.com/guardrails-ai/guardrails/issues/821

Implementation details The simplest path would be to download the tokenizer during our deploy script and included it in the distributable.

The downside to this approach is the tokenizer is ~38 MB.

An alternative, like previously mentioned, is to abandon Poetry and switch back to setuptools. This should allow us to implement post-install functionality in the setup.py; though we would need to verify this works in all the various ways the library can be installed.

Another alternative is to find a smaller, installable tokenizer to perform chunking.

End result [How should this feature be used?] No nltk downloads are performed during runtime.

CalebCourier avatar Jun 14 '24 17:06 CalebCourier

NLTK also poses problems for the AWS Lambda runtime. Also, agree on the fact that the size of the tokenizer itself is quite large. This is roughly 15% of the total allowed dependency size (~250MB) when deploying Lambdas with layers.

vprecup avatar Jul 17 '24 16:07 vprecup

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar Aug 17 '24 01:08 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar Sep 19 '24 03:09 github-actions[bot]

Instead of distributing, we're taking a different approach to not require nltk in the core package in favor of other, lighter tokenizers; while some validators may still require nltk, it will not be necessary for all. @AlejandroEsquivel should have more details on this soon.

CalebCourier avatar Sep 19 '24 13:09 CalebCourier

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar Oct 20 '24 03:10 github-actions[bot]

This issue was closed because it has been stalled for 14 days with no activity.

github-actions[bot] avatar Nov 04 '24 03:11 github-actions[bot]

no longer relevant - we removed nltk in 0.6.0

zsimjee avatar Nov 04 '24 05:11 zsimjee