langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust)

Open ByronHsu opened this issue 1 year ago • 3 comments

As the title says, I added more code splitters. The implementation is trivial, so i don't add separate tests for each splitter. Let me know if any concerns.

Fixes # (issue) https://github.com/hwchase17/langchain/issues/5170

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @eyurtsev @hwchase17

ByronHsu avatar May 24 '23 05:05 ByronHsu

could we add some unit tests for all these :) may be easier to have sep pr for each, but will let you decide

dev2049 avatar May 24 '23 18:05 dev2049

@dev2049 ok i will add tests and split the pr :)

ByronHsu avatar May 24 '23 21:05 ByronHsu

Added tests! plz take a look again. Thanks!

@dev2049

btw, may i ask why i run into mypy error locally, but not on github action?

~/learn-repo/langchain more-code-splitter !2 ❯ make lint                                                                                              4s  langchain byhsu@byhsu-ld1 22:50:19
poetry run mypy .
langchain/evaluation/loading.py:5: error: Incompatible import of "load_dataset" (imported name has type "Callable[[str, Optional[str], Optional[str], Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], None], Union[str, Split, None], Optional[str], Optional[Features], Optional[DownloadConfig], Optional[GenerateMode], bool, Optional[bool], bool, Union[str, Version, None], Union[bool, str, None], Union[str, TaskTemplate, None], bool, Any, KwArg(Any)], Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]]", local name has type "Callable[[str], List[Dict[Any, Any]]]")  [assignment]
langchain/evaluation/loading.py:8: error: No overload variant of "__getitem__" of "list" matches argument type "str"  [call-overload]
langchain/evaluation/loading.py:8: note: Possible overload variants:
langchain/evaluation/loading.py:8: note:     def __getitem__(self, SupportsIndex, /) -> Dict[Any, Any]
langchain/evaluation/loading.py:8: note:     def __getitem__(self, slice, /) -> List[Dict[Any, Any]]
langchain/document_loaders/hugging_face_dataset.py:81: error: Item "Dataset" of "Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]" has no attribute "keys"  [union-attr]
langchain/document_loaders/hugging_face_dataset.py:81: error: Item "IterableDataset" of "Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]" has no attribute "keys"  [union-attr]
langchain/document_loaders/hugging_face_dataset.py:82: error: Value of type "Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]" is not indexable  [index]
Found 5 errors in 2 files (checked 1024 source files)
make: *** [lint] Error 1

ByronHsu avatar May 25 '23 05:05 ByronHsu

@eyurtsev Can you review? Thanks!

ByronHsu avatar May 27 '23 22:05 ByronHsu

@ByronHsu i can take a look at mypy error on monday. i assume you checked from master?

eyurtsev avatar May 28 '23 03:05 eyurtsev

@dev2049 i've added notebook examples and improved the tests. Could you plz review again? Thanks!

ByronHsu avatar May 30 '23 02:05 ByronHsu