rdf4h
rdf4h copied to clipboard
Help parsing large file
Hi there, I am working on parsing a large turtle file, ideally I would like to turn it into an equivalent haskell program. I have been profiling the read function and see a growth over time in memory and other things :+1:
For 30k lines of the file I got these stats from rdf4h-3.0.1 release from stack.
total alloc = 29,235,026,136 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
>>= Text.Parsec.Prim Text/Parsec/Prim.hs:202:5-29 17.4 7.1
satisfy Text.Parsec.Char Text/Parsec/Char.hs:(140,1)-(142,71) 16.2 32.7
noneOf.\ Text.Parsec.Char Text/Parsec/Char.hs:40:38-52 14.3 0.0
We can see that a large amount of memory and time is spent in the parsec. I am wondering the following :
- can we parse this data incrementally ? Would it make sense to read this in line by line and feed that to the parser or something?
- can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.
- will attoparsec help?
Examples of the files are here: https://gist.github.com/h4ck3rm1k3/e1b4cfa58c4dcdcfc18cecab013cc6c9
Hi @h4ck3rm1k3 ,
Thanks for the report!
- will attoparsec help?
If you use the git repository for this library, then you can try experimental attoparsec support provided by @axman6 in November.
Try something like:
parseFile (TurtleParserCustom Nothing Nothing Attoparsec) "myfile.ttl"
Does that improve the memory performance?
- can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.
Interesting idea. What exactly would you want to convert to Haskell types? You might meaning:
-
The schema for each ontology used in a Turtle file? E.g. if the friend of a friend ontology is used, then
foaf:homepage
predicate would be turned into a Haskell type? For this, have you looked at type providers? It's that sort of thing, i.e. turning a closed world schema into types. F# has them, Haskell doesn't. -
Turning Turtle data into types? I'm not sure how that'd work, or why turning ontological instances (data as triples) into Haskell types would a useful thing to do, or what it'd look like.
I'm interested to know If attoparsec (above) gives you better results.
Yes, I have downloaded the git repo and am looking. I am interested in converting the data into types based on the schema or ontology I provide, for now I will create a custom one, but basically I want to call constructors of different forms based on the data in the rdf.
On Tue, Sep 19, 2017 at 8:05 AM, Rob Stewart [email protected] wrote:
Hi @h4ck3rm1k3 https://github.com/h4ck3rm1k3 ,
Thanks for the report!
- will attoparsec help?
If you use the git repository for this library, then you can try experimental attoparsec support provided by @axman6 https://github.com/axman6 in November.
Try something like:
parseFile (TurtleParserCustom Nothing Nothing Attoparsec) "myfile.ttl"
Does that improve the memory performance?
- can we convert the rdf into an equivalent haskell source program that would be compiled and strongly typed.
Interesting idea. What exactly would you want to convert to Haskell types? You might meaning:
The schema for each ontology used in a Turtle file? E.g. if the friend of a friend ontology is used, then foaf:homepage predicate would be turned into a Haskell type? For this, have you looked at type providers? It's that sort of thing, i.e. turning a closed world schema into types. F# has them, Haskell doesn't. 2.
Turning Turtle data into types? I'm not how that'd work, or why turning instances (triples) into Haskell types would a useful thing to do.
I'm interested to know If attoparsec (above) gives you better results.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/robstewart57/rdf4h/issues/44#issuecomment-330517723, or mute the thread https://github.com/notifications/unsubscribe-auth/AACIV6z0YtZOC9BM6-wG9cF0TPj32Ku6ks5sj62egaJpZM4PcERR .
-- James Michael DuPont
Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished. Like the sax model in xml parsing, then I could do my processing before the file is completed.
Test with of normal vs attoparse 30k lines we are still hovering around 0.5 seconds per 1k lines, The memory usage has gone down. That is still not very fast. I think next I want to look into some callback function. These are both with NTriplesParserCustom
Thu Sep 21 06:51 2017 Time and Allocation Profiling Report (Final)
gcc-haskell-exe +RTS -N -p -h -RTS
total time = 14.89 secs (14886 ticks @ 1000 us, 1 processor)
total alloc = 28,746,934,240 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
satisfy Text.Parsec.Char Text/Parsec/Char.hs:(140,1)-(142,71) 13.1 21.4
>>= Text.Parsec.Prim Text/Parsec/Prim.hs:202:5-29 11.4 13.5
mplus Text.Parsec.Prim Text/Parsec/Prim.hs:289:5-34 6.5 9.7
parsecMap.\ Text.Parsec.Prim Text/Parsec/Prim.hs:190:7-48 6.5 11.4
isSubDelims Network.URI Network/URI.hs:355:1-38 4.4 0.0
fmap.\ Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(171,7)-(172,42) 4.1 3.1
isGenDelims Network.URI Network/URI.hs:352:1-34 3.7 0.0
>>=.\.succ' Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76 3.5 1.1
encodeChar Codec.Binary.UTF8.String Codec/Binary/UTF8/String.hs:(50,1)-(67,25) 3.1 4.6
encodeString Codec.Binary.UTF8.String Codec/Binary/UTF8/String.hs:37:1-53 2.3 4.0
concat.ts' Data.Text Data/Text.hs:902:5-34 2.0 2.6
Testing with lastest version of rdf4h
Thu Sep 21 06:34 2017 Time and Allocation Profiling Report (Final)
gcc-haskell-exe +RTS -N -p -h -RTS
total time = 15.28 secs (15282 ticks @ 1000 us, 1 processor)
total alloc = 33,815,423,648 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
satisfy Text.Parsec.Char Text/Parsec/Char.hs:(140,1)-(142,71) 17.2 27.6
>>= Text.Parsec.Prim Text/Parsec/Prim.hs:202:5-29 16.5 22.8
parsecMap.\ Text.Parsec.Prim Text/Parsec/Prim.hs:190:7-48 9.2 8.4
mplus Text.Parsec.Prim Text/Parsec/Prim.hs:289:5-34 7.7 9.5
isSubDelims Network.URI Network/URI.hs:355:1-38 3.9 0.0
isGenDelims Network.URI Network/URI.hs:352:1-34 3.4 0.0
encodeChar Codec.Binary.UTF8.String Codec/Binary/UTF8/String.hs:(50,1)-(67,25) 2.9 3.9
encodeString Codec.Binary.UTF8.String Codec/Binary/UTF8/String.hs:37:1-53 2.2 3.4
parserReturn.\ Text.Parsec.Prim Text/Parsec/Prim.hs:234:7-30 2.0 3.1
Thinking about this, what I would really like is some mechanism to create a function that is applied for each statement read before the file is finished
Agree that this would be a good feature, moving towards generating on-the-fly streams of RDF triples whilst parsing, rather than parsing a file/string in entirety.
For example, from the API in the io-streams library, I can imagine to read an RDF source, we'd have a new type class:
class RdfParserStream p where
parseStringStream
:: (Rdf a)
=> p
-> Text
-> Either ParseFailure (InputStream (RDF a))
parseFileStream
:: (Rdf a)
=> p
-> String
-> IO (Either ParseFailure (InputStream (RDF a)))
parseURLStream
:: (Rdf a)
=> p
-> String
-> IO (Either ParseFailure (InputStream (RDF a)))
Then these triple streams could be connected to an output stream, e.g. a file output stream, using the io-streams API:
connect :: InputStream a -> OutputStream a -> IO ()
The big question I have for rdf and haskel is how to create instances of types from rdf data, is there any easy way to map rdf data via some ontology into haskell types?
@h4ck3rm1k3 sadly not, although that would be very cool.
There is some work in this area, for other languages including F# and Idris:
- F#: https://docs.microsoft.com/en-us/dotnet/fsharp/tutorials/type-providers/
- Idris: http://www.davidchristiansen.dk/pubs/dependent-type-providers.pdf
And also in Scala, where they have support for type providers from RDF data: https://github.com/travisbrown/type-provider-examples