bgcflow
bgcflow copied to clipboard
Error in running bigslice rule: Not in index, query string too large
run_status is now FEATURES_EXTRACTED
Building GCF models...
Dumping in-memory database content into /datadrive/data2/bgcflow/data/processed/strepto_combine/bigslice/cluster_as_6.1.1/result/data.db... Traceback (most recent call last):
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/bin/bigslice", line 1828, in main
clustering = BirchClustering.run(
^^^^^^^^^^^^^^^^^^^^
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/bigslice/modules/clustering/birch.py", line 177, in run
features_df = features_df.loc[bgc_ids]
~~~~~~~~~~~~~~~^^^^^^^^^
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
keyarr, indexer = ax._get_indexer_strict(key, axis_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6113, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: '[62043] not in index'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/bin/bigslice", line 1902, in <module>
return_code = main()
^^^^^^
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/bin/bigslice", line 1069, in main
with Database(data_db_path, use_memory) as output_db:
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/bigslice/modules/data/database.py", line 26, in __exit__
self.close()
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/bigslice/modules/data/database.py", line 150, in close
self.dump_db_file()
File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/bigslice/modules/data/database.py", line 162, in dump_db_file
out_db.executescript(query)
sqlite3.DataError: query string is too large
Does this work on the smaller dataset?
Hi @matinnuhamunada ,
I dont think I had this issue on smaller datasets.
Ran into this issue again on a dataset of 71 k BGCs.
This time I didn't get the second error : sqlite3.DataError: query string is too large
Building GCF models...
Dumping in-memory database content into /datadrive/bgcflow/data/processed/mq_strepto/bigslice/cluster_as_7.0.0/result/data.db... 148.1952s
Traceback (most recent call last):
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/bin/bigslice", line 1902, in <module>
return_code = main()
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/bin/bigslice", line 1828, in main
clustering = BirchClustering.run(
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/bigslice/modules/clustering/birch.py", line 177, in run
features_df = features_df.loc[bgc_ids]
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis)
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
keyarr, indexer = ax._get_indexer_strict(key, axis_name)
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6113, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: '[33221] not in index'
I found that one of the region genbank file had issues in the feature extraction step.
This entry 33221
was not found in the temporary output result/cache/bgc_features_1.pkl
. This entry is present in the data.db
file but the bgc_features tabls does not have any features extract for this entry.
Attaching the genbank file below for replicating the issue.