hawk More generic Input and Output format

I propose a refactory of the input and output format to make it more generic by following what is done on other systems like mapreduce. The aim of this refactoring is to make Hawk suitable for any kind of input steam, not only ASCII text. In fact after this refactoring, Hawk will be more than a text processor, it will be a stream processor. There are plenty of use cases starting from working on semi-structured data that isn't encoded in ASCII. One example are Hadoop Sequence Files, a very simple yet powerful file type.

Disclaimer: This idea is a work-in-progress and I haven't defined it completely. I will describe it only for the input format because the output format is already more powerful (we have Row and Rows that are already generic) and it requires another discussion :-).

The input format can be divided in two components: the Encoder and the RowFormat. The first converts the input raw stream to a structured stream like Text UTF8 while the second one is in charge to extract the next row of the stream and convert it into the data type on which the user expression will work.

To better explain what I want to do I will use an example of input encoder and of tabular format. This example will show how to work with both ASCII and Text UTF8 in a tabular way (list of list of fields).

Let's start from the Encoder example:

import qualified Data.ByteString.Lazy       as B
import qualified Data.ByteString.Lazy.Char8 as LC8
import qualified Data.Text.Lazy             as TL
import qualified Data.Text.Lazy.Encoding    as TextEncoding


class Encoder streamType where
  fromBS :: B.ByteString -> streamType

instance Encoder LC8.ByteString where
  fromBS = id

instance Encoder TL.Text where
  fromBS = TextEncoding.decodeUtf8

Then we can work on the RowFormat:

import           Control.Arrow
import qualified Data.ByteString.Lazy       as B
import qualified Data.ByteString.Char8      as C8
import qualified Data.ByteString.Lazy.Char8 as LC8
import           Data.ByteString.Lazy.Search (breakOn,split)
import           Data.Char (Char)
import           Data.Eq
import           Data.Function
import qualified Data.List                  as L
import qualified Data.Text                  as T
import qualified Data.Text.Lazy             as TL
import qualified Data.Text.Lazy.Encoding    as TextEncoding

class RowsFormat format streamType rowType where
  nextRow :: format -> streamType -> (rowType,streamType)

data TabularASCIIRowFormat = TabularI {
    wordsSep :: C8.ByteString
  , linesSep :: C8.ByteString
}

data TabularTextRowFormat = TabularTI {
    textWordsSep :: TL.Text
  , textLinesSep :: TL.Text
}

instance RowsFormat TabularASCIIRowFormat LC8.ByteString [C8.ByteString] where
  nextRow (TabularI ws ls) = first (L.map toStrict . split ws) . breakOn ls
    where toStrict = C8.concat . LC8.toChunks

instance RowsFormat TabularTextRowFormat TL.Text [T.Text] where
  nextRow (TabularTI ws ls) = first (L.map TL.toStrict . TL.splitOn ws) . TL.breakOn ls

(In theory the two Tabular formats can be united under a generic Tabular parametric format) Once we have this system up we can virtually work on any kind of streams. By default LC8.ByteString is used as Encoder and TabularASCIIRowFormat as RowFormat but the user can change that from the configuration. The options -d/-D still configures the RowFormat but in a more generic way (when possible). Two new options can be supplied to let the user define the Encoder and the RowFormat on-the-fly, like

> hawk --input-encoder 'UTF8' --row-format 'TabularTextRowFormat " " "\n"' --output-encoder 'SequenceFile' -a <...>
...

In this example Hawk encodes the input in Text UTF8, then splits the text in Rows with format [Text] and then applies the user expression to the input, so a function of type [[Text]] -> Rows a is expected as user expression. The output of Hawk is then encoded into a binary format.

Note that at this point setting -d/-D to an empty delimiter means to change the RowFormat (can't use tabular with empty delimiters, we must switch to another format).

Feb 15 '14 00:02 melrief

Sounds good! I've also been thinking about custom input and output formats, mostly in relation to Auto mode. Here is how I was imagining this system before reading about your version.

There would be a set of possible input formats, which the user could extend via a configuration file. Since this is Auto mode, there would be some code to detect that input format. I was thinking that the configuration file would be a Haskell module containing a bunch of functions of the form ByteString -> Maybe a, which would be tried one after the other until a match was found. For example, there would be a function trying to interpret the ByteString as XML, another as JSON, another as a PNG, and at the very end the default would be a raw ByteString.

The concrete type for a and the inferred type of the user expression would then be used during the next Auto phase, to figure out how to apply the user expression to the input. There would be another configuration file listing a bunch of functions of the same form: HawkRuntime -> a -> b -> IO (), where a is the input format and b is the type of the user expression. Once again, we would try all the functions in order until the first match, but this time the match would be determined by type unification. For example, if

the input type has been determined to be PNG
the user expression is id
there is a typeclass Tabular, and
there is an instance Tabular PNG (Int, Int, Int)

then the following functions are all considered a match:

applyExpr :: (Tabular t a, Rows r) => HawkRuntime -> t -> ([[a]] -> r) -> IO ()
applyExpr runtime t f = applyHawkExpr runtime f (toTable t)

mapExpr :: (Tabular t a, Row r) => HawkRuntime -> t -> ([a] -> r) -> IO ()
mapExpr runtime t f = mapHawkExpr runtime f (toTable t)

transformPNG :: HawkRuntime -> PNG -> (PNG -> PNG) -> IO ()
transformPNG runtime png f = evalHawkExpr runtime (pngToByteString (f png))

And then, of course, the first matching function would run, delegating most of the work to one of several Hawk builtins we would provide.

Do you think our ideas are compatible?

Feb 15 '14 04:02 gelisam

Yes, I think so. What is important for me is that both the systems are able to work with generic input making Hawk more interesting.

Before encoding I would like to use this issue as discussion about this system because it will be a heavy shift on how Hawk is done today. In particular I would like to preserve the non-auto mode as long as the auto mode. In my opinion they are both very interesting.

Feb 15 '14 10:02 melrief

I would like to add a note to my idea that is interesting also for yours: we don't need to change the input and the output format to achieve a generic input and output system. It depends on what we think is the standard usage of Hawk. The current input is very low level (ByteString with ASCII encoding) and when the delimiters are empty we are virtually giving the possibility to the user to work directly on the input bytes. If, for example, one would like to work on UTF8 it can already write something like:

> echo "ሴ" | hawk -d -D -a 'encodeUtf8 . id . decodeUtf8'
ሴ

in the example above Hawk is working with UTF8. By changing the two functions encodeUtf8 and decodeUtf8 we can achieve the same generic system that I presented before and that's thanks to ByteString (encodeUtf8 and decodeUtf8 are functions of Data.Text.Lazy.Encoding).

This system is very different from the first one because the encoding/decoding is now part of the user expression. This has drawbacks but doesn't require changes as the core of Hawk and I think it will work very well with your auto mode and your configuration systems (a file <coxtext_dir>/codec.hs or maybe a part of the configuration dedicated to the codecs). It is also trivial to extend.

Note that Rows allows the user to output ByteString, so one could create a special type that is serialized in binary while still being compatible with the current system. That's exactly why Hawk was working on ByteString instead of Text. For example, the user can use Data.Serialize.encode to serialize values into something more efficient than text.

In my opinion we should change the core only if we think the system presented here is too chunky to be used. Else we should keep the core as it is, make it as stable as possible and maybe add a codec library to let the user work on different kind of data. I don't have an idea on what I prefer, both the systems have problems, and that's why I think we should discuss more about this and the nature of Hawk.

Feb 15 '14 10:02 melrief

I forgot to mention that if we modify the core of Hawk like I said in the first post, then the system can be easily configurable (e.g. by setting UTF8 as default codec) while in the second case it could be more difficult to do. For instance, in the second case the map mode would work only with ByteString unless we do some heavy hacking. I've opened to discussion also about that :-)

Feb 15 '14 14:02 melrief

I'm lost, the word "system" is now referring to too many things :)

Let's give names to all those systems, at the same time this will allow me to summarize my understanding of the many ideas we have came up with in this thread:

With the "class-based encoding" system, ByteString would no longer be the core format into which the input chunks (fields, rows, or a single stream) are being parsed. Instead, there would be a type class Encoder and one instance for each supported format. I'm not sure if you intended to allow the user to define new instances of Encoder. Since your example used --input-encoder UTF8 instead of --input-encoder TL.Text, I guess they would be builtin.
With the "row producer" system, we would no longer use -d and -D to determine whether to split the input into rows, into fields, or to leave the input as a single stream. Instead, the user would specify a "row format" value, thereby specifying two things: which datatype to use and which value of that datatype to use. The datatype would be used to select an appropriate RowsFormat instance, which specifies how to consume the raw ByteString input and produce the next row. The value of that datatype would be used to customize this process: for example, it could specify delimiters.
With the "first-match parser" system, a sequence of parsers would attempt to parse the input into one of several formats. If -m is used, then an error is thrown unless the detected input format a is an instance of Tabular a ByteString.
With the "first-match mode" system, we would no longer use -m and -a to determine how to apply the user expression to the input. Instead, a sequence of handlers would attempt to massage the input and the user expression into a format accepted by one of Hawk's builtins. The selected "mode" would be determined by whichever handler is the first to typecheck when given the input and the user expression.
The "manual conversion" system is already available: if we need an input format other than rows or fields, we can use the raw input stream format and prepend our user expression with a function from ByteString to whatever format we need.

Does that sound right?

Feb 15 '14 20:02 gelisam

I don't know if the user wants to define its own encoder but I think it should be allowed. In general it is very difficult to consider all the possibilities. For the names, I'm sorry byt using UTF8 instead of TL.Text is a mistake (I always associate UTF8 to Text). The encoder should be a valid Haskell type, like TL.Text.
Yes
Yes, unless we find a way to use the map mode also with other kind of rows. We could, for example, say that every kind of data produced by the RowFormat is a valid Row, so we can map over it. The extreme case is when the Row is equivalent to the entire raw stream and in that case the map mode and the apply mode should be the same. For instance, now when the delimiters are empty we have only one record that is the entire stream (now this isn't working of course).
Yes
Yes. I did some tests and I can effectively work on XML and JSON without problems with the current system. ByteString allows that. The question is: do we want a more complex but modular system or not?

I think we should consider what we want Hawk to be able to do in future. If we think that Hawk will be used for ASCII text 90% of times then changing the input/output system is overkill. If we think that Hawk will be used for different purposes then we should consider a more modular yet configurable system. I'm open to discussion about this because I really don't have an idea about it.

I suggest to consider this after the 1.1 version because we should take some time to think about it :-). Once we change the Hawk core, going back can be a pain :-).

Feb 15 '14 21:02 melrief

The question is: do we want a more complex but modular system or not?

I do, but maybe not for 1.1!

Feb 15 '14 21:02 gelisam

hawk hawk copied to clipboard

More generic Input and Output format

hawk
hawk copied to clipboard