openfold icon indicating copy to clipboard operation
openfold copied to clipboard

Inference failed in data_pipeline.py: ValueError: setting an array element with a sequence.

Open longerzone opened this issue 2 years ago • 7 comments

Traceback (most recent call last): File "run_pretrained_openfold.py", line 257, in main(args) File "run_pretrained_openfold.py", line 118, in main fasta_path=fasta_path, alignment_dir=local_alignment_dir File "/data1/openfold/openfold/data/data_pipeline.py", line 575, in process_fasta msa_features = self._process_msa_feats(alignment_dir, input_sequence, _alignment_index) File "/data1/openfold/openfold/data/data_pipeline.py", line 539, in _process_msa_feats deletion_matrices=deletion_matrices, File "/data1/openfold/openfold/data/data_pipeline.py", line 212, in make_msa_features features["deletion_matrix_int"] = np.array(deletion_matrix, dtype=np.int32) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part. (openfold_venv) root@GPUhost:/data1/openfold#

longerzone avatar Apr 20 '22 07:04 longerzone

Could you elaborate? For which protein does this happen? How are you running OpenFold?

gahdritz avatar Apr 20 '22 19:04 gahdritz

Could you elaborate? For which protein does this happen? How are you running OpenFold?

Sorry , I'm testing with an test fasta here, and I running with the inference command in README.

and I add breakpoint at 211 line of data_pipeline.py, re-run the inference, found the deletion_matrix content:

(Pdb) len(deletion_matrix)
6
(Pdb) type(deletion_matrix[0])
<class 'list'>
(Pdb) len(deletion_matrix[0])
32763
(Pdb) len(deletion_matrix[1])
48502
(Pdb) len(deletion_matrix[2])
48502
(Pdb) len(deletion_matrix[3])
48502

And this error seems due to sub list has difference length:

>>> deletion_matrix=[[111, 222, 333], [1, 2, 3], [1, 2, 3]]
>>> np.array(deletion_matrix, dtype=np.int32)
array([[111, 222, 333],
       [  1,   2,   3],
       [  1,   2,   3]], dtype=int32)
>>> deletion_matrix=[[111, 222, 333], [1, 2, 3], [1, 2, 3, 4]]
>>> np.array(deletion_matrix, dtype=np.int32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

longerzone avatar Apr 22 '22 10:04 longerzone

Could you elaborate? For which protein does this happen? How are you running OpenFold?

Sorry , I'm testing with an test fasta here, and I running with the inference command in README.

and I add breakpoint at 211 line of data_pipeline.py, re-run the inference, found the deletion_matrix content:

(Pdb) len(deletion_matrix)
6
(Pdb) type(deletion_matrix[0])
<class 'list'>
(Pdb) len(deletion_matrix[0])
32763
(Pdb) len(deletion_matrix[1])
48502
(Pdb) len(deletion_matrix[2])
48502
(Pdb) len(deletion_matrix[3])
48502

And this error seems due to sub list has difference length:

>>> deletion_matrix=[[111, 222, 333], [1, 2, 3], [1, 2, 3]]
>>> np.array(deletion_matrix, dtype=np.int32)
array([[111, 222, 333],
       [  1,   2,   3],
       [  1,   2,   3]], dtype=int32)
>>> deletion_matrix=[[111, 222, 333], [1, 2, 3], [1, 2, 3, 4]]
>>> np.array(deletion_matrix, dtype=np.int32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

Hi @longerzone, I encountered this problem too, have you solved this problem?

willx-y avatar May 24 '22 15:05 willx-y

Sorry---this one slipped through the cracks. @longerzone that FASTA contains a DNA sequence, not a protein, so I wouldn't expect it to work with OpenFold out of the box. You'll need to make extensive changes to the data processing pipeline to accommodate the different nucleotide types and then retrain the model from scratch.

@willx-y are you also using DNA?

gahdritz avatar May 24 '22 19:05 gahdritz

I'm not sure if this is the same issue that @willx-y has. But, I got this same error message when running inference on a protein FASTA with the --use_precomputed_alignments option enabled. It turns out that my MSAs (which I'd gotten from the colabfold mmseqs2 server) unexpectedly had a null byte at the end of each .a3m file. When I removed those null bytes, it fixed the error!

calebthomas259 avatar Aug 30 '22 13:08 calebthomas259

I'm getting this same error for a variety of different protein sequences. I'm not using the --use_precomputed_alignments option. Any ideas on why this is happening? The sequences I'm getting these errors on have run through AlphaFold without any errors.

rrw1007 avatar Jan 31 '23 20:01 rrw1007

Is it every protein sequence, or just some?

gahdritz avatar Feb 06 '23 17:02 gahdritz