DeeplyTough icon indicating copy to clipboard operation
DeeplyTough copied to clipboard

Cannot retrieve some cluster files

Open L40S38 opened this issue 1 year ago • 2 comments

Hi.

I executed the command to evaluate on the Vertex dataset or the ProSPECCTS dataset. But I found almost the same error like below.

(I exported as $STRUCTURE_DATA_DIR = $DEEPLYTOUGH/datasets_structure. Also, I omitted the path to the repository)

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5324k 100 5324k 0 0 1471k 0 0:00:03 0:00:03 --:--:-- 1472k INFO:datasets.vertex:Preprocessing: downloading data and extracting pockets, this will take time. INFO:root:cluster file path: DeeplyTough/datasets_structure/bc-30.out WARNING:root:Cluster definition not found, will download a fresh one. WARNING:root:However, this will very likely lead to silent incompatibilities with any old 'pdbcode_mappings.pickle' files! Please better remove those manually. Traceback (most recent call last): File "DeeplyTough/deeplytough/scripts/vertex_benchmark.py", line 68, in main() File "DeeplyTough/deeplytough/scripts/vertex_benchmark.py", line 32, in main database.preprocess_once() File "DeeplyTough/deeplytough/datasets/vertex.py", line 49, in preprocess_once clusterer = RcsbPdbClusters(identity=30) File "DeeplyTough/deeplytough/misc/utils.py", line 248, in init self._fetch_cluster_file() File "DeeplyTough/deeplytough/misc/utils.py", line 262, in _fetch_cluster_file self._download_cluster_sets(cluster_file_path) File "DeeplyTough/deeplytough/misc/utils.py", line 253, in _download_cluster_sets request.urlretrieve(f'https://cdn.rcsb.org/resources/sequence/clusters/bc-{self.identity}.out', cluster_file_path) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 248, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 223, in urlopen return opener.open(url, data, timeout) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 532, in open response = meth(req, response) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 642, in http_response 'http', request, response, code, msg, hdrs) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 570, in error return self._call_chain(*args) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 504, in _call_chain result = func(*args) File "anaconda3/envs/deeplytough/lib/python3.6/urllib/request.py", line 650, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 404: Not Found

I successed when evaluated on the TOUGH-M1 dataset, so I'm afraid of some URL to the Vertex and ProSPECCTS data is expired. Would you mind check about that?

L40S38 avatar Sep 24 '22 07:09 L40S38

Hey @L40S38, thanks for opening a ticket. It seems this is due to the RCSB PDBs cluster file moving. See https://www.rcsb.org/news/feature/6205750d8f40f9265109d39f (in fact its discontinued and changed, so this may even have scientific implications for DeeplyTough)

I will have a look into it. If you don't need to use the cluster file (e.g. if you are happy with random splitting, or you just want to run the existing models) I believe you can just specify a different splitting method.

JoshuaMeyers avatar Sep 24 '22 13:09 JoshuaMeyers