bgcflow icon indicating copy to clipboard operation
bgcflow copied to clipboard

Error in running bigslice rule: Not in index, query string too large

Open OmkarSaMo opened this issue 2 years ago • 3 comments

run_status is now FEATURES_EXTRACTED
Building GCF models...
Dumping in-memory database content into /datadrive/data2/bgcflow/data/processed/strepto_combine/bigslice/cluster_as_6.1.1/result/data.db... Traceback (most recent call last):
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/bin/bigslice", line 1828, in main
    clustering = BirchClustering.run(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/bigslice/modules/clustering/birch.py", line 177, in run
    features_df = features_df.loc[bgc_ids]
                  ~~~~~~~~~~~~~~~^^^^^^^^^
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6113, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: '[62043] not in index'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/bin/bigslice", line 1902, in <module>
    return_code = main()
                  ^^^^^^
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/bin/bigslice", line 1069, in main
    with Database(data_db_path, use_memory) as output_db:
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/bigslice/modules/data/database.py", line 26, in __exit__
    self.close()
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/bigslice/modules/data/database.py", line 150, in close
    self.dump_db_file()
  File "/datadrive/data2/bgcflow/.snakemake/conda/c53188cf74abfdc284d24b807d803856_/lib/python3.11/site-packages/bigslice/modules/data/database.py", line 162, in dump_db_file
    out_db.executescript(query)
sqlite3.DataError: query string is too large

OmkarSaMo avatar Dec 14 '22 09:12 OmkarSaMo

Does this work on the smaller dataset?

matinnuhamunada avatar Dec 15 '22 08:12 matinnuhamunada

Hi @matinnuhamunada ,

I dont think I had this issue on smaller datasets.

Ran into this issue again on a dataset of 71 k BGCs.

This time I didn't get the second error : sqlite3.DataError: query string is too large

Building GCF models...
Dumping in-memory database content into /datadrive/bgcflow/data/processed/mq_strepto/bigslice/cluster_as_7.0.0/result/data.db... 148.1952s
Traceback (most recent call last):
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/bin/bigslice", line 1902, in <module>
    return_code = main()
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/bin/bigslice", line 1828, in main
    clustering = BirchClustering.run(
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/bigslice/modules/clustering/birch.py", line 177, in run
    features_df = features_df.loc[bgc_ids]
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis)
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6113, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/datadrive/bgcflow/.snakemake/conda/26e7e769f1f6f064495b3c7cb06b8207_/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
KeyError: '[33221] not in index'

OmkarSaMo avatar Jul 12 '23 14:07 OmkarSaMo

I found that one of the region genbank file had issues in the feature extraction step.

This entry 33221 was not found in the temporary output result/cache/bgc_features_1.pkl. This entry is present in the data.db file but the bgc_features tabls does not have any features extract for this entry.

Attaching the genbank file below for replicating the issue.

NZ_AP024135.1.region002.zip

OmkarSaMo avatar Jul 13 '23 22:07 OmkarSaMo