dh-core icon indicating copy to clipboard operation
dh-core copied to clipboard

datasets : harmonize Netflix parsers with the rest

Open ocramz opened this issue 7 years ago • 0 comments

The Netflix Prize dataset uses a custom parser because one data example does not fit into a single dataset row (such as CSV data) but has a custom "stanza-based" format. For example, these are two stanzas of the "qualifying.txt" data file :

1:
1046323,2005-12-19
1080030,2005-12-23
2127527,2005-12-04
1944918,2005-10-05
1057066,2005-11-07
954049,2005-12-20
10:
12868,2004-10-19
627923,2005-12-16
690763,2005-12-13

It would be nice to upgrade the library such that it can deal with these cases

Solution sketch:

  • Add one constructor to ReadAs that can accept an attoparsec parser as parameter

ocramz avatar Dec 30 '18 10:12 ocramz