GearNet
GearNet copied to clipboard
Problems in using FC dataset
Hello! Thank you for your great work of Torchdrug, GearNet, and ESM-GearNet! Sorry to bother you. I'm trying to extract feature embeddings using GearNet (as discussed in several former issues) on EC, GO, and FC dataset (as provided in https://zenodo.org/records/7593591). It is easy to notice that different from EC and GO where proteins are provided in pdb format, proteins in FC are in hdf5 format, so I use your Fold3d class in GearNet (https://github.com/DeepGraphLearning/GearNet/blob/main/gearnet/dataset.py) to preprocess the data. However, when I pass the Protein class into GearNet network following the instructions in Torchdrug, I met the following errors when running on GPU:
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
and then
RuntimeError: Error building extension 'torch_ext':
...
... ...site-packages/torchdrug/utils/extension/torch_ext.cpp:1:
/usr/include/features.h:424:12: fatal error: sys/cdefs.h: No such file or directory
424 | # include <sys/cdefs.h>
| ^~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
When running on CPU, I met:
NotImplementedError: Could not run 'aten::view' with arguments from the 'SparseCPU' backend
I searched for the cause of these errors on the Internet but found that I couldn't solve them because they are related to the environment. I'm wondering why I don't meet any of the problems when directly use Protein.from_pdb() on EC and GO, but encounter these problems on FC where I use your Fold3D class to get also data.Protein instances.
For reference, my code is as follows:
...
# graph
graph_construction_model = layers.GraphConstruction(node_layers=[geometry.AlphaCarbonNode()],
edge_layers=[geometry.SpatialEdge(radius=10.0, min_distance=5),
geometry.KNNEdge(k=10, min_distance=5),
geometry.SequentialEdge(max_distance=2)],
edge_feature="gearnet")
# model
gearnet_edge = models.GearNet(input_dim=21, hidden_dims=[512, 512, 512, 512, 512, 512],
num_relation=7, edge_input_dim=59, num_angle_bin=8,
batch_norm=True, concat_hidden=True, short_cut=True, readout="sum")
pthfile = 'models/mc_gearnet_edge.pth'
net = torch.load(pthfile, map_location=torch.device(device))
#print('torch succesfully load model')
gearnet_edge.load_state_dict(net)
gearnet_edge.eval()
print('successfully load gearnet')
def get_subdataset_rep(pdbs: list, proteins: list, subroot: str):
for idx in range(0, len(pdbs), bs): # reformulate to batches
pdb_batch = pdbs[idx : min(len(pdbs), idx + bs)]
protein_batch = proteins[idx : min(len(pdbs), idx + bs)]
# protein
_protein = data.Protein.pack(protein_batch)
_protein.view = "residue"
print(_protein)
final_protein = graph_construction_model(_protein)
print(final_protein)
# output
with torch.no_grad():
output = gearnet_edge(final_protein, final_protein.node_feature.float(), all_loss=None, metric=None)
print(output['graph_feature'].shape, output['node_feature'].shape)
counter = 0
for idx in range(len(final_protein.num_residues)): # idx: protein/graph id in this batch
this_graph_feature = output['graph_feature'][idx]
this_node_feature = output['node_feature'][counter : counter + final_protein.num_residues[idx], :]
print(this_graph_feature.shape, this_node_feature.shape)
torch.save((this_graph_feature, this_node_feature), f"{subroot}/{os.path.splitext(pdb_batch[idx])[0].split('/')[-1]}.pt")
counter += final_protein.num_residues[idx]
break
# get representations
if args.task not in ['FC', 'fc']:
for root in roots:
pdbs = [os.path.join(root, i) for i in os.listdir(root)]
proteins = []
for pdb_file in pdbs:
try:
protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
protein.view = "residue"
proteins.append(protein)
except:
error_fn = os.path.basename(root) + '_' if args.task in ['EC', 'ec', 'GO', 'go'] else ''
with open(f"{error_path}/{args.task}_{error_fn}error.txt", "a") as f:
f.write(os.path.splitext(pdb_file)[0].split('/')[-1] + '\n')
f.close()
if len(proteins) == bs: # for debug
break
subroot = os.path.join(output_dir, root.split('/')[-1]) if args.task in ['EC', 'ec', 'GO', 'go'] else output_dir
get_subdataset_rep(pdbs, proteins, subroot)
break
else:
transform = transforms.Compose([transforms.ProteinView(view='residue')])
dataset = Fold3D(root, transform=transform) #, atom_feature=None, bond_feature=None
split_sets = dataset.split() # train_set, valid_set, test_fold_set
print('There are', len(split_sets), 'sets in total.')
for split_set in split_sets:
print(split_set.indices)
this_slice = slice(list(split_set.indices)[0], (list(split_set.indices)[-1] + 1))
this_pdbs, this_datas = dataset.pdb_files[this_slice], dataset.data[this_slice]
#for fn, protein in zip(this_pdbs, this_datas):
# print(fn, protein)
# break
get_subdataset_rep(this_pdbs, this_datas, os.path.join(output_dir, this_pdbs[0].split('/')[0]))
Are there any ways to solve the problem, or is my understanding of torchdrug wrong? Sincerely looking forward to your help. Thank you very much!
Hi, I don't think this should be a dataset-specific problem. It seems that you fail to build the torch_extension
in TorchDrug. Could you check this?
Hi! Thank you for your reply! I checked my torch_extension based on https://github.com/DeepGraphLearning/torchdrug/issues/8 and https://github.com/DeepGraphLearning/torchdrug/issues/238. I'm sure that my torch_ext.cpp
lies correctly under torchdrug/utils/extension
, and I tried to delete the folder torch_extensions
which lies under /home/your_user_name/.cache
but it doesn't work.