sel icon indicating copy to clipboard operation
sel copied to clipboard

Unable to parse sqlite3.c

Open eschulte opened this issue 3 years ago • 9 comments

Unable to parse the single-file sqlite3.c "amalgamation." The file is attached sqlite3.c.txt. The result is a single error AST.

SEL/SW/C-CPP> (defparameter sqlite (from-file (make-instance 'c) "~/Projects/sqlite-bld/sqlite3.c"))
SQLITE
SEL/SW/C-CPP> (describe (@ sqlite 0))
#<C-ERROR 1217039 :TEXT "/*******************...">
  [standard-object]

Slots with :CLASS allocation:
  CHILD-SLOTS                    = ((CHILDREN . 0))
  CHILD-SLOT-SPECIFIERS          = (#<FUNCTIONAL-TREES::SLOT-SPECIFIER CHILDREN 0>)
Slots with :INSTANCE allocation:
  STORED-HASH                    = NIL
  OID                            = 1219463
  DESCENDANT-MAP                 = #<unbound slot>
  SERIAL-NUMBER                  = 1217039
  SIZE                           = #<unbound slot>
  ANNOTATIONS                    = NIL
  BEFORE-TEXT                    = ""
  TEXT                           = "/********************************************************************..
  AFTER-TEXT                     = ""
  BEFORE-ASTS                    = NIL
  AFTER-ASTS                     = NIL
  ORDERED-CHILDREN               = NIL
  INDENT-CHILDREN                = NIL
  INDENT-ADJUSTMENT              = NIL
  CHILDREN                       = NIL
; No value

eschulte avatar Aug 22 '22 00:08 eschulte

A shorthand was added for from-file a little while back, and it can be called with the symbol:

(from-file 'c "~/path/to/file.c")

https://tree-sitter.github.io/tree-sitter/playground can be used to determine if this is tree-sitter generating the error. Similarly, calling cl-tree-sitter:parse-string can be used for the same effect. In this case, it is producing the error, so this is expected.

This also appears to be generated with an older version of SEL since the root node should be an error-variation-point:

SEL/SW/TS> (from-file 'c "~/Downloads/sqlite3.c.txt")
#<C ~/Downloads/sqlite3.c.txt>
SEL/SW/TS> (genome *)
#<C-ERROR-VARIATION-POINT 8 :TEXT "/*******************...">

Switching between which variation is used on the variation point can be done with sel/sw/ts:*use-variation-point-tree*.

berchn avatar Aug 22 '22 15:08 berchn

This file takes a long time to parse, so I'm going to open up an internal issue for profiling it when someone gets a chance.

berchn avatar Aug 22 '22 15:08 berchn

I identified an O(N^2) algorithm in cl-tree-sitter that slows parsing of files with very long child lists, and submitted a pull request to death/cl-tree-sitter to fix it.

https://github.com/death/cl-tree-sitter/pull/6

pfdietz avatar Aug 25 '22 14:08 pfdietz

Awesome.

  1. Thanks for the from-file shorthand, I wasn't aware.
  2. In general some of these options could use more easy discoverability. In particular I'm thinking of *use-variation-point-tree* and with-attr-table (which I ran into yesterday). I don't have any good suggestions here 😄, just thought I'd note.
  3. Also, appreciate looking into the parse time! I'm going to move on to a smaller program for my near-term work.

eschulte avatar Sep 03 '22 19:09 eschulte

I have also identified what is likely causing the slow parsing of sqlite3.c. It's due to repeated concatenate on octet vectors when handling error nodes. In sqlite3.c, almost the entire file ends up in an error node, so N is very large (the file has >200K lines.) Nathaniel is looking at this (it should be possible to get it down to linear time.)

I will be adding some scalability tests to sel soon. Two of these illustrate the problem.

Having said that, having almost all of the file in an error node is not very useful. This may be more of a tree-sitter issue.

pfdietz avatar Sep 03 '22 23:09 pfdietz

The problem, aside from #if 0 around extern "C" { and }, is in the file is in the function winRead. In particular, there is the following code:

#if SQLITE_OS_WINCE || defined(SQLITE_WIN32_NO_OVERLAPPED)
  if( winSeekFile(pFile, offset) ){
    OSTRACE(("READ pid=%lu, pFile=%p, file=%p, rc=SQLITE_FULL\n",
             osGetCurrentProcessId(), pFile, pFile->h));
    return SQLITE_FULL;
  }
  while( !osReadFile(pFile->h, pBuf, amt, &nRead, 0) ){
#else
  memset(&overlapped, 0, sizeof(OVERLAPPED));
  overlapped.Offset = (LONG)(offset & 0xffffffff);
  overlapped.OffsetHigh = (LONG)((offset>>32) & 0x7fffffff);
  while( !osReadFile(pFile->h, pBuf, amt, &nRead, &overlapped) &&
         osGetLastError()!=ERROR_HANDLE_EOF ){
#endif

Notice the unmatched open braces. Tree-sitter will fail on this.

I think someone could write a program that expanded the scope of ifdef/else blocks by pulling in following code (duplicating it) until all the braces, parents, etc. are balanced.

pfdietz avatar Sep 05 '22 01:09 pfdietz

2. In general some of these options could use more easy discoverability.  In particular I'm thinking of `*use-variation-point-tree*` and `with-attr-table` (which I ran into yesterday).  I don't have any good suggestions here smile, just thought I'd note.

I've opened an issue to add some documentation on variation points as there isn't anything in the manual at the moment. As a side note in regards to discoverability and so that you're aware of it, https://github.com/GrammaTech/sel/blob/master/components/configuration.lisp was recently added to help configure SEL parameters.

berchn avatar Sep 06 '22 14:09 berchn

scalability tests have been added to sel. They are not "actual" tests, but functions you can invoke inside time or sb-sprof:with-profiling forms to see where time is going in parsing on large variable-sized C programs. See test/scalability.lisp

pfdietz avatar Sep 07 '22 14:09 pfdietz

The configuration.lisp support seems handy, very nice.

W.r.t. the errors, I think special bespoke handling for very common problem structures like #if 0 makes a ton of sense, and it sounds like that might be sufficient to get most (if not all) of sqlite.c to parse.

eschulte avatar Sep 10 '22 21:09 eschulte