pyntcloud icon indicating copy to clipboard operation
pyntcloud copied to clipboard

perf: read_ply replace pandas.read_csv engine=python with c; improve read_off header-parsing robustness

Open YodaEmbedding opened this issue 1 year ago • 1 comments

UPDATE: I have rebased this PR on top of the latest commit. The revised changes are:

  • perf: Speed up reading of ASCII PLY files.
  • feat: improve robustness for OFF headers on e.g. ModelNet40
  • perf: reuse already open file for reading instead of opening it twice
  • style: renamed variables for clarity (e.g. color -> has_color; and count -> n_header)

In particular, ModelNet40 has faulty headers:

$ head -n 1 ModelNet40/chair/train/chair_0856.off
OFF6586 5534 0

For reference, the correct format is:

OFF
6586 5534 0

Nonetheless, it is still valuable to parse the faulty header.


(Original text before #353 was merged)

Big performance improvement by removing the need to use the slow engine="python" by reading the sliced file from an in-memory StringIO buffer.

Also fixes bug where OFF files containing more lines than num_points + num_faces tries to read potential edges as faces!

As Wikipedia says, the OFF file may contain:

  • points
  • faces (optional)
  • edges (optional)

Of course, this still does not encompass all possible OFF file variants described by Wikipedia, but it's an improvement.

YodaEmbedding avatar Aug 08 '23 11:08 YodaEmbedding

Both this PR and #353 improved pandas performance for *.OFF files with engine=c. Therefore, I rebased this PR on top of #353. This PR still contains some other useful changes, listed above.


Future work:

Once this is reviewed/accepted, I can look into improving compatibility with Wikipedia's description of the *.OFF file format. Of course, perfect compatibility is too slow, but there's still some missing features:

  • "C" in the header should not be needed to detect the presence of color (see Wikipedia's example).
  • Edges, and edge colors.

YodaEmbedding avatar Dec 24 '23 12:12 YodaEmbedding