LPeg-Parsers icon indicating copy to clipboard operation
LPeg-Parsers copied to clipboard

Parsing common data formats via LPeg

The code herein contains LPeg [1] routines for parsing some common data formats. The current formats are:

abnf

The core ruleset from RFC-5234.  These rules are used often in RFCs.

ascii ascii.char ascii.control ascii.ctrl

Match a single ASCII character.  The top level module will match
both graphical charaters and control characters.  The "ascii.char"
module only matches the graphical characters; the "ascii.control"
module only matches the control codes.  The "ascii.ctrl" will return
the name of a control character, or nil if not a control character.

iso iso.char iso.control iso.ctrl

Match a single ISO character.  The top level module will match both
graphical, control characters and control sequences.  The "iso.char"
only matches the ISO graphical characters; the iso.control" module
mathces only control characters (for example, <ESC>E or \133).  The
"iso.ctrl" will return the name of the control character, plus any
associated data as appropriate.  For example:

	<ESC>[32;40m

will return a name of "SGR" and a two element array containing 32
and 40.

NOTE:  These modules only deal with the ISO defined characters, and
will NOT match those defined by ASCII.  To match a graphical character
that matches both ASCII and ISO:

	char = require "org.conman.parsers.ascii.char
	     + require "org.conman.parsers.iso.char

email

Parses email headers as defined in:

	RFC-0822	Internet Message Format
	RFC-1036	Standard for Interchange of USENET Messages
	RFC-2045	Multipurpose Internet Mail Extensions I
	RFC-2046	Multipurpose Internet Mail Extensions II
	RFC-2047	Multipurpose Internet Mail Extensions III
	RFC-2048	Multipurpose Internet Mail Extensions IV
	RFC-2369	The Use of URLs as Meta-Syntax for Core Mail 
			List Commands and their Transport through 
			Message Header Fields
	RFC-2822	Internet Message Format	
	RFC-2919	A Structured Field and Namespace for the Identification of Mailing Lists
	RFC-5064	The Archived-At Message Header Field
	RFC-5322	Internet Message Format

Headers are returned in a Lua table.  For example, the following
headers:

	Return-Path: <[email protected]>
	Received: from brevard.conman.org (brevard.conman.org 
		[66.252.224.242])
		by mail.example.com (Postfix) 
		with ESMTP id 538562EA5D07
		for <[email protected]>; 
		Fri, 28 Dec 2012 21:40:11 -0500
	From: Sean Conner <[email protected]>
	To: Sherlock Holmes <[email protected]>,
		the-scooby-gang: Fred <[email protected]>,
			Daphne <[email protected]>,
			Velma <[email protected]>,
			Shaggy <[email protected]>,
			Scobby-Doo <[email protected]>;,
		The Batman <[email protected]>
	Subject: I know who did it!
	Date: Fri, 28 Dec 2012 21:40:11 -0500
	Message-ID: <[email protected]>

Will return the following Lua table:

	{
	  received =
	  {
	    [1] =
	    {
	      with = "ESMTP",
	      from = "brevard.conman.org",
	      id = "538562EA5D07",
	      when =
	      {
	        min = 0.000000,
	        zone = -18000.000000,
	        day = 28.000000,
	        month = 12.000000,
	        year = 2012.000000,
	        sec = 1.000000,
	        hour = 1.000000,
	        weekday = "Fri",
	      },
	      for =
	      {
	        address = "[email protected]",
	      },
	      by = "mail.example.com",
	    },
	  },
	  to =
	  {
	    [1] =
	    {
	      name = "Sherlock Holmes",
	      address = "[email protected]",
	    },
	    [2] =
	    {
	      ['the-scooby-gang'] =
	      {
	        [1] =
	        {
	          name = "Fred",
	          address = "[email protected]",
	        },
	        [2] =
	        {
	          name = "Daphne",
	          address = "[email protected]",
	        },
	        [3] =
	        {
	          name = "Velma",
	          address = "[email protected]",
	        },
	        [4] =
	        {
	          name = "Shaggy",
	          address = "[email protected]",
	        },
	        [5] =
	        {
	          name = "Scobby-Doo",
	          address = "[email protected]",
	        },
	      },
	    },
	    [3] =
	    {
	      name = "The Batman",
	      address = "[email protected]",
	    },
	  },
	  from =
	  {
	    [1] =
	    {
	      name = "Sean Conner",
	      address = "[email protected]",
	    },
	  },
	  date =
	  {
	    min = 0.000000,
	    zone = -18000.000000,
	    day = 28.000000,
	    month = 12.000000,
	    year = 2012.000000,
	    sec = 1.000000,
	    hour = 1.000000,
	    weekday = "Fri",
	  },
	  return_path =
	  {
	    [1] =
	    {
	      address = "[email protected]",
	    },
	  },
	  message_id = "[email protected]",
	  subject = "I know who did it!",
	}

The only fields not supported are the Resent-* fields; they are
rarely used and the semantics are particularly hard to support via
parsing only.  These fields, as well as any other fields not
otherwise understood or parsable will end up on a field called
'generic' with the key being the raw header name.

json

Implements a JSON parser.  It requires some additional modules [2]
to run.  This will parse a JSON file into a Lua table.  The full
grammar is supported, but it expects the input to be valid UTF-8.

A JSON null value will be converted to nil.  If you won't want this
behavior, define a global variable called "null" to be the value
you want for a JSON null.

jsons

Another implementation of a JSON parser.  This one "streams" the
input, meaning it will handle large JSON files the other one won't,
and is a drop in replacement.  You can also pass in a function that
will return more data so you can actually "stream" data into the
parser.

ip

Provides two LPeg patterns, IPv4 and IPv6 which parse and convert
said addresses directly into their network-order binary formats.

ip-text

Provides two LPeg patterns, IPv4 and IPv6 which parse and return said
addresses as text, unlike the ip module above.

ini

Provides a INI file parser that returns a Lua table from a INI
file.  A sample INI file such as:

	; we allow "default" values

	default = ok

	[section1]

	var1 = foo
	var2 = 12,23,34,54,44
	VAR3 = "var3=foo",33,44,55
	var2 = apple
	Var4 = 55

	[section2]
		# another comment
		; and so is this one
	
		var1 = foo bar baz ; this is a comment
		var2 = "foo bar baz ; this is not a comment"

	[section1]

		var4=this is a test
		var5= this is also a test
		var2 = pear
		var3 = 88,99


will result in a Lua table of:

	{
	  section1 =
	  {
	    var1 = "foo",
	    var5 = "this is also a test",
	    var4 =
	    {
	      [1] = "55",
	      [2] = "this is a test",
	    },
	    var3 =
	    {
	      [1] = "var3=foo",
	      [2] = "33",
	      [3] = "44",
	      [4] = "55",
	      [5] = "88",
	      [6] = "99",
	    },
	    var2 =
	    {
	      [1] = "12",
	      [2] = "23",
	      [3] = "34",
	      [4] = "54",
	      [5] = "44",
	      [6] = "apple",
	      [7] = "pear",
	    },
	  },
	  default = "ok",
	  section2 =
	  {
	    var1 = "foo bar baz ",
	    var2 = "foo bar baz ; this is not a comment",
	  },
	}

strftime Parses the format string for strftime() (or os.date() for Lua) and returns an LPeg expression that can parse that format, with the exceptions of "%c", "%x" and "%X" (all system specific formats). For example, the format, "%A, %d %B %Y @ %H:%M:%S" will return the LPeg expression to parse this:

	Monday, 02 July 2018 @ 16:02:48

into

	{
	  min = 2.000000,
	  wday = 2.000000,
	  day = 2.000000,
	  month = 7.000000,
	  sec = 48.000000,
	  hour = 16.000000,
	  year = 2018.000000,
	}

This will even work with other locales, such as "se_NO.UTF-8",
which will allow LPeg to parse:

	vuossárga, 02 suoidnemánu 2018 @ 16:02:48

into

	{
	  min = 2,000000,
	  wday = 2,000000,
	  day = 2,000000,
	  month = 7,000000,
	  sec = 48,000000,
	  hour = 16,000000,
	  year = 2018,000000,
	}

url

Parses URLs per RFC-3986.  By default, it will handle the following
URL types:

	http:
	https:
	file:
	ftp:

Given the following URL:

	http://www.conman.org/people/spc/index.cgi?one=1%3F&two=2&three=3#target1

It will be broken down into a Lua table as follows:

	{
              scheme   = "http",
	  host     = "www.conman.org",
              port     = 80,
	  path     = "/people/spc/index.cgi",
	  query    = "one=1%3F&two=2&three=3",
	  fragment = "target1",
	}

Other URLs can be parsed, but a URL like:

	mailto:[email protected][email protected],[email protected]&subject=Current%20Mystery

will be broken down as:

	{
	  scheme = "mailto",
	  path   = "[email protected]",
          query  = "[email protected],[email protected]&subject=Current%20Mystery",
            }

which may require more parsing than provided here.

url.gopher

Parses "gopher:" URLs per RFC-4266.  Given this URL:

	gopher://gopher.conman.org/0foobar%09search%20String%09plus

it will be broken down as:

	{
              scheme   = "gopher",
	  host     = "gopher.conman.org",
              port     = 70.000000,
	  type     = "file",
              selector = "foobar",
	  search   = "search String",
	  plus     = "plus",
	}

If you need to parse other URLs in addition to "gopher:" types,
you can do:

	gopher = require "org.conman.parsers.url.gopher"
	url    = require "org.conman.parsers.url"
	
	url  = gopher + url
	info = url:match(my_url)	

url.siptel

Parses "sip:" and "sips:" URIs per RFC-3261.  
Parses "tel:" URIs per RFC-3966.

Examples:

	sip = require "org.conman.parsers.url.sip"
	u = sip:match [[sip:[email protected];play=file://fs.example.net//clips/my-intro.dvi;content-type=video/mpeg%3bencode%d3314M-25/625-50]]

results in:

	{
	  host       = "example.com",
	  port       = 5060.000000,
	  user       = "annc",
	  scheme     = "sip",
	  parameters =
	  {
	    play             = "file://fs.example.net//clips/my-intro.dvi",
	    ["content-type"] = "video/mpeg%3bencode%d3314M-25/625-50",
	  },
	}

and 

	u = sip:match [[sip:+1-(555)-555-1212;[email protected];user=phone]]

results in:

	{
	  host = "example.net",
	  port = 5060.000000,
	  user =
	  {
	    number     = "15555551212",
	    global     = true,
	    parameters =
	    {
	      ext = "1234",
	    },
	  },
	  scheme     = "sip",
	  parameters =
	  {
	    user = "phone",
	  },
	}

and

	tel = require "org.conman.parsers.url.tel"
	u = tel:match "tel:+1-(555)-555-1212;ext=1234"

results in:

	{
	  scheme = "tel",
	  number = "15555551212",
	  global = true,
	  parameters =
	  {
	    ext = "1234",
	  },
	}

If you need to parse other URLs in addition to these types,
you can do:

	siptel = require "org.conman.parsers.url.sip"
	url    = require "org.conman.parsers.url"

	url  = siptel + url
	info = url:match(my_url)

soundex.lua

Implements the Soundex algorithm.

[1] http://www.inf.puc-rio.br/~roberto/lpeg/

[2] https://github.com/spc476/lua-conmanorg