hawk
hawk copied to clipboard
More generic Input and Output format
I propose a refactory of the input and output format to make it more generic by following what is done on other systems like mapreduce. The aim of this refactoring is to make Hawk suitable for any kind of input steam, not only ASCII text. In fact after this refactoring, Hawk will be more than a text processor, it will be a stream processor. There are plenty of use cases starting from working on semi-structured data that isn't encoded in ASCII. One example are Hadoop Sequence Files, a very simple yet powerful file type.
Disclaimer: This idea is a work-in-progress and I haven't defined it completely. I will describe it only for the input format because the output format is already more powerful (we have Row
and Rows
that are already generic) and it requires another discussion :-).
The input format can be divided in two components: the Encoder
and the RowFormat
. The first converts the input raw stream to a structured stream like Text UTF8 while the second one is in charge to extract the next row of the stream and convert it into the data type on which the user expression will work.
To better explain what I want to do I will use an example of input encoder and of tabular format. This example will show how to work with both ASCII and Text UTF8 in a tabular way (list of list of fields).
Let's start from the Encoder
example:
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.Lazy.Char8 as LC8
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TextEncoding
class Encoder streamType where
fromBS :: B.ByteString -> streamType
instance Encoder LC8.ByteString where
fromBS = id
instance Encoder TL.Text where
fromBS = TextEncoding.decodeUtf8
Then we can work on the RowFormat
:
import Control.Arrow
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.Char8 as C8
import qualified Data.ByteString.Lazy.Char8 as LC8
import Data.ByteString.Lazy.Search (breakOn,split)
import Data.Char (Char)
import Data.Eq
import Data.Function
import qualified Data.List as L
import qualified Data.Text as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TextEncoding
class RowsFormat format streamType rowType where
nextRow :: format -> streamType -> (rowType,streamType)
data TabularASCIIRowFormat = TabularI {
wordsSep :: C8.ByteString
, linesSep :: C8.ByteString
}
data TabularTextRowFormat = TabularTI {
textWordsSep :: TL.Text
, textLinesSep :: TL.Text
}
instance RowsFormat TabularASCIIRowFormat LC8.ByteString [C8.ByteString] where
nextRow (TabularI ws ls) = first (L.map toStrict . split ws) . breakOn ls
where toStrict = C8.concat . LC8.toChunks
instance RowsFormat TabularTextRowFormat TL.Text [T.Text] where
nextRow (TabularTI ws ls) = first (L.map TL.toStrict . TL.splitOn ws) . TL.breakOn ls
(In theory the two Tabular formats can be united under a generic Tabular parametric format)
Once we have this system up we can virtually work on any kind of streams. By default LC8.ByteString
is used as Encoder
and TabularASCIIRowFormat
as RowFormat
but the user can change that from the configuration. The options -d/-D
still configures the RowFormat
but in a more generic way (when possible). Two new options can be supplied to let the user define the Encoder
and the RowFormat
on-the-fly, like
> hawk --input-encoder 'UTF8' --row-format 'TabularTextRowFormat " " "\n"' --output-encoder 'SequenceFile' -a <...>
...
In this example Hawk encodes the input in Text UTF8, then splits the text in Rows with format [Text]
and then applies the user expression to the input, so a function of type [[Text]] -> Rows a
is expected as user expression. The output of Hawk is then encoded into a binary format.
Note that at this point setting -d/-D
to an empty delimiter means to change the RowFormat
(can't use tabular with empty delimiters, we must switch to another format).
Sounds good! I've also been thinking about custom input and output formats, mostly in relation to Auto mode. Here is how I was imagining this system before reading about your version.
There would be a set of possible input formats, which the user could extend via a configuration file. Since this is Auto mode, there would be some code to detect that input format. I was thinking that the configuration file would be a Haskell module containing a bunch of functions of the form ByteString -> Maybe a
, which would be tried one after the other until a match was found. For example, there would be a function trying to interpret the ByteString as XML, another as JSON, another as a PNG, and at the very end the default would be a raw ByteString.
The concrete type for a
and the inferred type of the user expression would then be used during the next Auto phase, to figure out how to apply the user expression to the input. There would be another configuration file listing a bunch of functions of the same form: HawkRuntime -> a -> b -> IO ()
, where a
is the input format and b
is the type of the user expression. Once again, we would try all the functions in order until the first match, but this time the match would be determined by type unification. For example, if
- the input type has been determined to be PNG
- the user expression is
id
- there is a typeclass
Tabular
, and - there is an instance
Tabular PNG (Int, Int, Int)
then the following functions are all considered a match:
applyExpr :: (Tabular t a, Rows r) => HawkRuntime -> t -> ([[a]] -> r) -> IO ()
applyExpr runtime t f = applyHawkExpr runtime f (toTable t)
mapExpr :: (Tabular t a, Row r) => HawkRuntime -> t -> ([a] -> r) -> IO ()
mapExpr runtime t f = mapHawkExpr runtime f (toTable t)
transformPNG :: HawkRuntime -> PNG -> (PNG -> PNG) -> IO ()
transformPNG runtime png f = evalHawkExpr runtime (pngToByteString (f png))
And then, of course, the first matching function would run, delegating most of the work to one of several Hawk builtins we would provide.
Do you think our ideas are compatible?
Yes, I think so. What is important for me is that both the systems are able to work with generic input making Hawk more interesting.
Before encoding I would like to use this issue as discussion about this system because it will be a heavy shift on how Hawk is done today. In particular I would like to preserve the non-auto mode as long as the auto mode. In my opinion they are both very interesting.
I would like to add a note to my idea that is interesting also for yours: we don't need to change the input and the output format to achieve a generic input and output system. It depends on what we think is the standard usage of Hawk. The current input is very low level (ByteString
with ASCII encoding) and when the delimiters are empty we are virtually giving the possibility to the user to work directly on the input bytes. If, for example, one would like to work on UTF8 it can already write something like:
> echo "ሴ" | hawk -d -D -a 'encodeUtf8 . id . decodeUtf8'
ሴ
in the example above Hawk is working with UTF8. By changing the two functions encodeUtf8
and decodeUtf8
we can achieve the same generic system that I presented before and that's thanks to ByteString
(encodeUtf8
and decodeUtf8
are functions of Data.Text.Lazy.Encoding
).
This system is very different from the first one because the encoding/decoding is now part of the user expression. This has drawbacks but doesn't require changes as the core of Hawk and I think it will work very well with your auto
mode and your configuration systems (a file <coxtext_dir>/codec.hs
or maybe a part of the configuration dedicated to the codecs). It is also trivial to extend.
Note that Rows
allows the user to output ByteString
, so one could create a special type that is serialized in binary while still being compatible with the current system. That's exactly why Hawk was working on ByteString
instead of Text
. For example, the user can use Data.Serialize.encode
to serialize values into something more efficient than text.
In my opinion we should change the core only if we think the system presented here is too chunky to be used. Else we should keep the core as it is, make it as stable as possible and maybe add a codec library to let the user work on different kind of data. I don't have an idea on what I prefer, both the systems have problems, and that's why I think we should discuss more about this and the nature of Hawk.
I forgot to mention that if we modify the core of Hawk like I said in the first post, then the system can be easily configurable (e.g. by setting UTF8 as default codec) while in the second case it could be more difficult to do. For instance, in the second case the map mode would work only with ByteString unless we do some heavy hacking. I've opened to discussion also about that :-)
I'm lost, the word "system" is now referring to too many things :)
Let's give names to all those systems, at the same time this will allow me to summarize my understanding of the many ideas we have came up with in this thread:
- With the "class-based encoding" system, ByteString would no longer be the core format into which the input chunks (fields, rows, or a single stream) are being parsed. Instead, there would be a type class
Encoder
and one instance for each supported format. I'm not sure if you intended to allow the user to define new instances ofEncoder
. Since your example used--input-encoder UTF8
instead of--input-encoder TL.Text
, I guess they would be builtin. - With the "row producer" system, we would no longer use
-d
and-D
to determine whether to split the input into rows, into fields, or to leave the input as a single stream. Instead, the user would specify a "row format" value, thereby specifying two things: which datatype to use and which value of that datatype to use. The datatype would be used to select an appropriateRowsFormat
instance, which specifies how to consume the raw ByteString input and produce the next row. The value of that datatype would be used to customize this process: for example, it could specify delimiters. - With the "first-match parser" system, a sequence of parsers would attempt to parse the input into one of several formats. If
-m
is used, then an error is thrown unless the detected input formata
is an instance ofTabular a ByteString
. - With the "first-match mode" system, we would no longer use
-m
and-a
to determine how to apply the user expression to the input. Instead, a sequence of handlers would attempt to massage the input and the user expression into a format accepted by one of Hawk's builtins. The selected "mode" would be determined by whichever handler is the first to typecheck when given the input and the user expression. - The "manual conversion" system is already available: if we need an input format other than rows or fields, we can use the raw input stream format and prepend our user expression with a function from ByteString to whatever format we need.
Does that sound right?
- I don't know if the user wants to define its own encoder but I think it should be allowed. In general it is very difficult to consider all the possibilities. For the names, I'm sorry byt using
UTF8
instead ofTL.Text
is a mistake (I always associate UTF8 to Text). The encoder should be a valid Haskell type, likeTL.Text
. - Yes
- Yes, unless we find a way to use the map mode also with other kind of rows. We could, for example, say that every kind of data produced by the
RowFormat
is a validRow
, so we can map over it. The extreme case is when theRow
is equivalent to the entire raw stream and in that case the map mode and the apply mode should be the same. For instance, now when the delimiters are empty we have only one record that is the entire stream (now this isn't working of course). - Yes
- Yes. I did some tests and I can effectively work on XML and JSON without problems with the current system.
ByteString
allows that. The question is: do we want a more complex but modular system or not?
I think we should consider what we want Hawk to be able to do in future. If we think that Hawk will be used for ASCII text 90% of times then changing the input/output system is overkill. If we think that Hawk will be used for different purposes then we should consider a more modular yet configurable system. I'm open to discussion about this because I really don't have an idea about it.
I suggest to consider this after the 1.1
version because we should take some time to think about it :-). Once we change the Hawk core, going back can be a pain :-).
The question is: do we want a more complex but modular system or not?
I do, but maybe not for 1.1!