unconf16 icon indicating copy to clipboard operation
unconf16 copied to clipboard

Importing from weird text formats

Open leeper opened this issue 10 years ago • 6 comments

A substantial proportion of questions on StackOverflow are about how to read in data from weird text formats that aren't covered by the usual functionality. Sometimes these are just fixed-width files that users aren't familiar with or a slightly malformed TSV, but other times they're things like one of various flavors of markdown table, MediaWiki tables, or something else.

Lots of data is stored in these kinds of formats (e.g., on Wikipedia) but is locked up by the difficult-to-parse format.

Can we invent some functionality for parsing these formats and turning them into a data.frame?

leeper avatar Mar 06 '16 13:03 leeper

Most of these formats are handled by pandoc. A relatively easy way to handle this task may be to use a pandoc wrapper to convert the document to pandoc's native JSON format, and then write a function to import and convert that JSON. It may be even less work to convert to HTML and use rvest or something similar to import the HTML tables.

noamross avatar Mar 08 '16 15:03 noamross

Jeroen has wrapped a commonmark parser, which would avoid the (potentially lossy) html transition for md tables: https://github.com/jeroenooms/commonmark

Possibly one can get something like a parse tree out of pandoc too, I don't know.

richfitz avatar Mar 08 '16 17:03 richfitz

@leeper does it make sense since there are a variety of different formats to have a suite of recipes (scripts) for parsing weird/odd formats, some of which may use pkg X and others pkg Y + Z, rather than a pkg, which may be spread very thin b/c of many diff. dependencies (and possibly heavy ones like pandoc)

sckott avatar Mar 08 '16 17:03 sckott

@sckott I think that's a great idea! If we had one go-to place to show strategies for reading in data, that could be really useful. Maybe it's even just creating some StackOverflow r-faqs with clear and somewhat general tutorials.

leeper avatar Mar 08 '16 17:03 leeper

I needed this so I tried doing my suggestion above and building a package wrapping pandoc to convert formats to HTML and then importing via rvest::html_table: https://github.com/noamross/texttable

Unfortunately there are a lot of formats which pandoc's table readers are wonky, and you don't get good HTML tables in the output at the moment. But it works for markdown, docx, org-mode, textile, and some others.

@daattali You might be interested in this, too.

noamross avatar Apr 18 '16 21:04 noamross

Thanks @noamross

daattali avatar Apr 18 '16 21:04 daattali