llama_index
llama_index copied to clipboard
[Bug]: CodeSplitter takes exactly 1 argument (2 given)
Bug Description
CodeSplitter doesn't seem to be working as expected
Version
0.10.36
Steps to Reproduce
from llama_index.core.node_parser import CodeSplitter
from llama_index.readers.file import FlatReader
from pathlib import Path
documents = FlatReader().load_data(Path("index.py"))
splitter = CodeSplitter(
language="python"
chunk_lines=40, # lines per chunk
chunk_lines_overlap=15 # lines overlap between chunks
max_chars=1500, # max chars per chunk
)
nodes = splitter.get_nodes_from_documents(documents)
nodes[0].text
Relevant Logs/Tracbacks
python test.py
Could not get parser for language python. Check https://github.com/grantjenks/py-tree-sitter-languages#license for a list of valid languages.
Traceback (most recent call last):
File "/Users/framsouza/git-assistant/test.py", line 6, in <module>
splitter = CodeSplitter(
^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/text/code.py", line 63, in __init__
parser = tree_sitter_languages.get_parser(language)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "tree_sitter_languages/core.pyx", line 19, in tree_sitter_languages.core.get_parser
File "tree_sitter_languages/core.pyx", line 14, in tree_sitter_languages.core.get_language
TypeError: __init__() takes exactly 1 argument (2 given)
Hello @framsouza! 👋 I'm Dosu, a bot here to lend a hand with bugs, answer questions, and guide you on your journey to becoming a contributor, all while we wait for a human maintainer to chime in. Delighted to meet you! I'm diving into your issue and will get back to you with a full answer shortly.
@framsouza yea this seems like an issue with the tree-sitter version. I don't actually know what the solution is yet, but a band-aid fix was pip install -U "tree-sitter<0.22.0"
hey @logan-markewich , i just gave it a try:
tree-sitter 0.21.3
tree-sitter-go 0.21.0
tree-sitter-languages 1.10.2
llama-index 0.10.36
llama-index-core 0.10.36
pipeline = IngestionPipeline(
transformations=[
CodeSplitter(language="go", chunk_lines=20, chunk_lines_overlap=5),
Settings.embed_model
],
vector_store=get_es_vector_store()
)
got a different error,
Data loaded from local directory.
Starting the pipeline...
Parsing nodes: 0%| | 0/2002 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/framsouza/git-assistant/index.py", line 166, in <module>
main()
File "/Users/framsouza/git-assistant/index.py", line 159, in main
pipeline.run(show_progress=True, documents=documents)
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/ingestion/pipeline.py", line 734, in run
nodes = run_transformations(
^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/ingestion/pipeline.py", line 124, in run_transformations
nodes = transform(nodes, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 127, in __call__
return self.get_nodes_from_documents(nodes, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 76, in get_nodes_from_documents
nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 145, in _parse_nodes
splits = self.split_text(node.get_content())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/text/code.py", line 161, in split_text
raise ValueError(f"Could not parse code with language {self.language}.")
ValueError: Could not parse code with language go.
I can see go on supported languages.
@framsouza Seems like the if statement at the top if failing
This is the code, I'm moving from SentenceSpliter to CodeSpliter since I'm ingesting code type of information
The issues you're encountering with CodeSplitter in your LlamaIndex setup seem to stem from two main areas: the initialization of CodeSplitter and the processing of Go language code.
-
Initialization Error: The
TypeErrorsuggests a mismatch in the expected arguments for theCodeSplitterconstructor. Ensure all required parameters are correctly provided and match the expected types. The error indicates that the constructor is not receiving the expected number of arguments, which could be due to a syntax error or missing parameters in your call. -
Processing Go Language Code: The
ValueErrorindicates a problem with parsing Go language code. This could be due to thetree_sitter_languagespackage not supporting Go or a misconfiguration in your setup. Ensure that:- The
tree_sitter_languagespackage is correctly installed and supports Go. - You have the correct version of
tree_sitterandtree_sitter_languagesthat includes Go language support. If Go is not supported, you may need to manually load a Go language grammar fortree-sitter.
- The
For both issues, verify that you're using compatible versions of llama-index, tree-sitter, and tree_sitter_languages. If the problem persists, consider manually initializing a tree-sitter Parser for Go and passing it to the CodeSplitter constructor, ensuring you have a Go language grammar available for tree-sitter.
Lastly, ensure your error handling in the implementation is robust, especially for unsupported languages or parsing errors, to gracefully manage these situations.
Details
Sources
- llama-index-core/llama_index/core/node_parser/text/code.py
- llama-index-legacy/llama_index/legacy/node_parser/text/code.py
About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
To continue the conversation, mention @dosu.
- create a pure python env
- install llama_index
my version is 0.10.37 and this is my code:
import tree_sitter_java as tsjava
from llama_index.core.node_parser import CodeSplitter
from llama_index.readers.file.flat.base import FlatReader
from tree_sitter import Language, Parser
CODEBASE_DIR = "your code"
JAVA_LANGUAGE = Language(tsjava.language())
parser = Parser(JAVA_LANGUAGE)
language = "java"
documents = FlatReader().load_data(Path(CODEBASE_DIR))
splitter = CodeSplitter(
parser = parser,
language = language,
chunk_lines = 40, # lines per chunk
chunk_lines_overlap = 15, # lines overlap between chunks
max_chars = 1500 # max chars per chunk
)
nodes = splitter.get_nodes_from_documents(documents)
print(len(nodes))
it works!
https://github.com/grantjenks/py-tree-sitter-languages/issues/64
@framsouza perhaps you can build the parser object outside and pass it into the CodeSplitter?
It was provided specifically to not couple with tree-sitter-languages.
something like:
from tree_sitter import Language, Parser, Node
import tree_sitter_go as tsgo
LANG = Language(tsgo.language())
parser = Parser(LANG)
CodeSplitter(language='go', parser=parser)
@logan-markewich will it make sense to add version check for tree-sitter within the library?
import pkg_resources
from packaging.version import Version
installed_treesitter_package = [(d.project_name, d.version) for d in pkg_resources.working_set if d.project_name.lower() == 'tree-sitter']
if len(installed_treesitter_package) == 0:
raise ImportError("tree-sitter not installed")
else:
print("tree-sitter installed")
print("Version: ", installed_treesitter_package[0][1])
if Version(installed_treesitter_package[0][1]) > Version('0.22.0'):
print("Using tree-sitter version > 0.22.0")
... # create with new API
else:
print("Using tree-sitter version <= 0.22.0")
... # process using old API
@framsouza yea this seems like an issue with the tree-sitter version. I don't actually know what the solution is yet, but a band-aid fix was
pip install -U "tree-sitter<0.22.0"
This worked for me with llama_index(version 0.10.65). Thank you
llama-index==0.11.16
load_dotenv==0.1.0
tree_sitter_languages==0.21.3
llama-index-vector-stores-elasticsearch==0.3.2
llama-index-embeddings-openai==0.2.5
llama_index.llms.ollama==0.3.3
spacy==3.8.2
IPython==8.28.0
tabulate==0.9.0
The CodeSplitter seems to work, however, the chunk_size doesn't seem to be applied.
parsers_and_extensions = [
(SentenceSplitter(chunk_size=1000, chunk_overlap=200), [".md"]),
(CodeSplitter(chunk_lines=50, language='python'), [".py", ".ipynb"]),
(CodeSplitter(language='javascript'), [".js"]),
(CodeSplitter(language='typescript'), [".ts"]),
(JSONNodeParser(), [".json"]),
(CodeSplitter(language='go'), [".go"]),
(CodeSplitter(language='hcl'), [".hcl"]),
(CodeSplitter(language='bash'), [".bash"]),
(CodeSplitter(language='yaml'), [".yml", ".yaml"]),
]
for parser, extensions in parsers_and_extensions:
matching_files = []
for ext in extensions:
matching_files.extend(glob.glob(f"{local_repo_path}/**/*{ext}", recursive=True))
if len(matching_files) > 0:
file_summary.append(f"Found {len(matching_files)} {', '.join(extensions)} files in the repository.")
loader = SimpleDirectoryReader(input_dir=local_repo_path, required_exts=extensions, recursive=True)
docs = loader.load_data()
parsed_nodes = parser.get_nodes_from_documents(docs)
print_docs_and_nodes(docs, parsed_nodes)
nodes.extend(parsed_nodes)
else:
file_summary.append(f"No {', '.join(extensions)} files found in the repository.")
collect_and_print_file_summary(file_summary)
print("\n")
return nodes
output example:
---
Document ID: 2282832b-b036-4acc-9e38-259fa8f6de50
Document Content:
import json
import yaml
import os
from datetime import datetime
from dotenv import load_dotenv
from config.load_yaml import load_yaml
from services.pagerduty import connect_pagerduty
from services.rootly import connect_rootly
from validation.validate_documentation import validate_docs
from utils.determine_maturity import determine_maturity_level
from validation.validate_catalog import connect_catalog
from validation.validate_jira import validate_jira
from validation.validate_pd_service import validate_pd_service
from validation.validate_pd_escalation_policy import validate_pd_escalation_policy
from validation.validate_logs_and_metrics import validate_logs_config, validate_metrics_config
from validation.validate_slo import validate_slo
from validation.validate_alerts import validate_alerts
from validation.validate_dashboards import validate_dashboards
from validation.validate_rootly import validate_rootly
from validation.validate_synthetics import validate_synthetics
load_dotenv('.env')
def load_yaml(file_path):
with open(file_path, 'r') as file:
return yaml.safe_load(file)
def get_service_name(file_path, use_default):
if use_default:
return os.path.basename(os.getcwd())
else:
return os.path.basename(os.path.dirname(file_path)) or "default_service_name"
def main():
default_file_path = 'example.yaml'
use_default = input(f"Do you want to use the default file path '{default_file_path}'? (Y/N, default: Y): ").strip().lower() or 'y'
if use_default == 'y':
file_path = default_file_path
use_default = True
else:
file_path = input("Enter the path of the YAML file: ")
use_default = False
service_name = get_service_name(file_path, use_default)
timestamp = datetime.now().isoformat()
print(f"{timestamp}: Service Name: {service_name}")
data = load_yaml(file_path)
results = {
'service_name': service_name,
'documentation': validate_docs(data),
'service_catalog': connect_catalog(data),
'jira_issue_tracking': validate_jira(data),
'escalation_policy': validate_pd_escalation_policy(data),
'pagerduty_service': validate_pd_service(data),
'system_logs': validate_logs_config(data),
'system_metrics': validate_metrics_config(data),
'slo_name': validate_slo(data),
'alerts': validate_alerts(data),
'dashboards': validate_dashboards(data),
'rootly_incident': validate_rootly(data),
'synthetics': validate_synthetics(data)
}
maturity_level = determine_maturity_level(results)
results['service_maturity_level'] = maturity_level
json_results = json.dumps(results, indent=2)
print(json_results)
if __name__ == "__main__":
main()
---
Document ID: 2dc30e86-fa74-4149-8a07-1b5e0732d745
Around the fall of 2024 time frame, tree-sitter and tree-sitter-languages went through a bunch of breaking changes that I also experienced as part of one of my projects. The tree-sitter-languages library is pretty much defunct even though the package maintainer hasn't labeled it that way. You can still use it with older tree-sitter, but not newer.
The later versions of tree-sitter only interoperate with tree-sitter-\<language\> libraries. For example tree-sitter-python, tree-sitter-go, etc. There is no longer a single library for all the languages.
https://pypi.org/project/tree-sitter-python/ https://pypi.org/project/tree-sitter-go/ https://pypi.org/project/tree-sitter-java/
The new way to use tree-sitter is more like:
tree_sitter_lang = importlib.import_module(f"tree_sitter_{lang}")
language = tree_sitter.Language(tree_sitter_lang.language())
parser = tree_sitter.Parser(language)
tree-sitter-language-pack is great work, I can fast use any language👍
from tree_sitter_language_pack import get_parser
rust_parser = get_parser('rust')
coder = CodeSplitter(language="rust", parser=rust_parser)
code_list = coder.split_text(demo_rust_code)