jsonsl
jsonsl copied to clipboard
Embeddable, Fast, Streaming, Non-Buffering JSON Parser
=begin html

=end html
=head1 JSONSL
JSON Stateful (or Simple, or Stacked, or Searchable, or Streaming) Lexer
=head1 Why another (and yet another) JSON lexer?
I took inspiration from some of the uses of I<YAJL>, which looked quite nice, but whose build system seemed unusable, source horribly mangled, and grown beyond its original design. In other words, I saw it as a bunch of cruft.
Instead of bothering to spend a few days figuring out how to use it, I came to a conclusion that the tasks I needed (simple token notifications coupled with some kind of state shift detection), I could do with a simple, small, ANSI C embeddable source file.
I am still not sure if I<YAJL> provides the featureset of I<JSONSL>, but
I'm guessing I've got at least I
I<JSONSL>
Inspiration was also taken from Joyent's B
Here's a quick featureset
=over
=item Stateful
Maintains state about current descent/recursion/nesting level Furthermore, you can access information about 'lower' stacks as long as they are activ.
=item Decoupling Object Graph from Data
JSONSL abstracts the object graph from the actual (and usually more CPU-intensive) work of actually populating higher level structures such as "hashes" and "arrays" with "decoded" and "meaningful" values. Using this, one can implement an on-demand type of conversion.
=item Callback oriented, selectively
Invokes callbacks for all sorts of events, but you can control which kind of events you are interested in receiving without writing a ton of wrapper stubs
=item Non-Buffering
This doesn't buffer, copy, or allocate any data. The only allocation overhead is during the initialization of the parser, in which the initial stack structures are initialized
=item Simple
Just a C source file, and a corresponding header file. ANSI C.
While attempts will be made to add functionality and reduce boilerplate in your code, the core functions are simple and clearly defined.
Add-ons (see below) are available (and exist in the same jsonsl.c file)
=item JSONPointer search add-on
Use L<JSONPointer|http://tools.ietf.org/html/draft-pbryan-zyp-json-pointer-02> to query JSON streams as they arrive. Quite efficient, and very simple (see jpr_test.c for examples)
=item Unescaping utility add-on
Includes a nice little function which can flexibly unescape JSON strings to match your specifications.
=back
The rest of this documentation needs work
=head1 Details
=head2 Terminology
Because the JSON spec is quite confusing in its terminology, especially when we want to map it to a different model, here is a listing of the terminology used here.
I will use I
I will use the term I
I will use the term I for those things which look like C<["hello", "byebye"]>,
and their contents as I
or I
=head2 Model
=head3 States
A state represents a JSON element, this can be a a hash (C<T_OBJECT>), array (C<T_LIST>), hash key (C<T_HKEY>), string (C<T_STRING>), or a 'special' value (C<T_SPECIAL>) which should be either a numeric value, or one of C<true, false, null>.
A state comprises and maintains the following information
=over
=item Type
This merely states what type it is - as one of the C<JSONSL_T_*> constants mentioned above
=item Positioning
This contains positioning information mapping the location of the element
as an offset relative to the input stream. When a state begins, its I
=item Extended Information
For non-scalar state types, information regarding the number of children contained is stored.
=item User Data
This is a simple void* pointer, and allows you to associate your own data with a given state
=back
=head3 Stack
A stack consists of multiple states. When a state begins, it is I
When a state is popped, the contained information regarding positioning and children is complete, and it is therefore possible to retrieve the entire element in its byte-stream.
Once a state has been popped, it is considered invalid (though it is still valid during the callback).
Below is a diagram of a sample JSON stream annotated with stack/state information.
Level 0 {
Level 1
Level 2
"ABC"
:
Level 2
"XYZ"
,
Level 1
[
Level 2
{
Level 3
Level 4
"Foo":"Bar"
Level 3
}
Level 2
]
Level 1
}
=head1 USING
The header file C<jsonsl.h> contains the API. Read it.
As an additional note, you can 'extend' the state structure (thereby eliminating the need to allocate extra pointers for the C<void *data> field) by defining the C<JSONSL_STATE_USER_FIELDS> macro to expand to additonal struct fields.
This is assumed as the default behavior - and should work when you compile your project with C<jsonsl.c> directly.
If you wish to use the 'generic' mode, make sure to C<#define> or C<-D> the C<JSONSL_STATE_GENERIC> macro.
Some notes regarding usage will follow:
=head2 Position and Offset Tracking
The state object contains some C
=over
=item Buffer the entire stream (simpler, but not recommended)
This way, the offsets declared in the C<< jsn->pos >> and C<pos_begin> variables can be directly applied as offsets to the actual buffer.
Of course this is not the recommended option; since C
=item Note the first valid position in the existing buffer
This technique requires the user to keep track of the first valid position within the current buffer. This is useful for tracking the beginnings and ends of strings.
Typically you will need a simple function or macro and some variables which do the following:
=over
=item *
Contain the minimum valid position in the buffer, e.g. C<min_available>
This is initially set to 0, and increases as we discard data (see later)
=item *
Allow callbacks to request an advancement of the position. This means
that your context object contains a "min_needed" variable. For example,
one might have a C<PUSH> callback for the beginning of a string. The push
callback will set the C<min_needed> variable to the position
of the beginning of the string (i.e. C<< state->pos_begin >>). In a corresponding
C<POP> callback, the string is read from an internal buffer (whose first valid
position is no greater than C<< state->pos_begin >>) with a length of
C<< jsn->pos - state->pos_begin >> bytes. Once the string is read, it is
no longer needed, and the callback then updates the C<min_needed> variable
to the parser's C
=item *
After C<jsonsl_feed> is called, determine if the input buffer needs to be adjusted. This means to determine whether the C<min_needed> variable has been set to something larger than the C<min_available> variable. If this condition is true, it means part of the buffer can be discarded. The amount of bytes to discard from the beginning will be the difference between these two variables. The length of the buffer also becomes shorted by the difference.
Once the bytes are discarded (one can use a simple C
=item *
To demonstrate this, let's make a sample structure:
struct parse_context {
size_t min_needed;
size_t min_available;
char *buffer;
size_t buffer_len;
}
The C
It is possible to write a simple function which will get a slice of the buffer,
given the absolute offsets from the C
void get_state_buffer(struct parse_context *ctx, struct jsonsl_state_st *state)
{
size_t offset = state->pos_begin - ctx->min_available;
return ctx->buffer + offset;
}
Of course this function would probably like to do some error checking to ensure
that for example, the C
=back
=back
=head2 Notes on String States
It is possible to get the I
The logic may be encapsulated in a macro
#define NORMALIZE_OFFSETS(buf, len) (buf)++; (len)++;
/* use it */
char *buf = get_state_buffer(ctx, state);
size_t len = jsn->pos - state->pos_begin;
NORMALIZE_OFFSETS(buf, len);
Note that care should be taken not to perform this on I
=head2 Notes on I
The C
Interally it builds a graph based on inputs from each item in the JSON tree; relying on the fact that an item will only be a C<MATCH_POSSIBLE> or C<MATCH_COMPLETE> if its parent was also a C<MATCH_POSSIBLE>.
Thus the C
In general, items need their I
=over
=item Object (dictionary) values are passed with their keys
This means you must buffer the keys
=item Array elements are passed with their indices
C<jsonsl_jpr_match_state> does this for you automatically
=item Primitives without any children are not passed
C
=back
=head2 UNICODE
While JSONSL does not support unicode directly (it does not
decode \uxxx escapes, nor does it care about any non-ascii
characters), you can compile JSONSL using the C<JSONSL_USE_WCHAR>
macro. This will make jsonsl iterate over C<wchar_t> characters
instead of the good 'ole C
=head2 NaN, Infinity, -Infinity
By default, JSONSL does not consider objects like C<{"n": NaN}>,
C<{"n": Infinity}>, or C<{"n": -Infinity}> to be valid.
Compile with C<JSONSL_PARSE_NAN> defined to parse these non-numbers.
JSONSL will then execute your POP callback with C<state-E
=head2 WINDOWS
JSONSL Now has a visual studio C<.sln> and C<.vcxproj> files in the
C
If you wish to use JSONSL as a DLL, be sure to define the macro C<JSONSL_DLL> which will properly decorate the prototypes with C<__declspec(dllexport)>.
You can also run the tests on windows using the C
=head2 API AND ABI STABILITY, AND PACKAGING NOTES
I<JSONSL> will attempt to maintain a stable API, but not a stable ABI.
The general distribution model of JSONSL (A single source file and a single header) is designed in such a manner to allow the application to I
For speed benefits it may also be desirable to actually have the C<jsonsl.c>
file embedded in the same translation unit as the user-side code calling into
I<JSONSL> itself. In such use cases, the C<JSONSL_API> macro may be defined as
C
=head1 AUTHOR AND COPYRIGHT
Copyright (C) 2012-2017 Mark Nunberg.
See C<LICENSE> for license information.