tq icon indicating copy to clipboard operation
tq copied to clipboard

Feature request: output <table> elements as tables

Open Lucas-C opened this issue 4 years ago • 8 comments

What do you think of the idea ? Maybe this could be enabled through a CLI flag.

It could be done relatively easily using tabulate, PrettyTable or Pylsy.

Lucas-C avatar Oct 14 '19 11:10 Lucas-C

That's an interesting idea. I have been using a bookmarklet to extract tables from webpages as CSV files. A command line tool for this have crossed my mind a few times.

The default output format for tq is till plain text, but I am pretty sure that most people that script tq, use the json options as the provide the ability to properly consume the extracted data. For example with jq.

We could define a flag to enable table parsing and throw an error if the selected element is not a table. The output could be by default in json lines. An json array of strings per line. This would be the bare minimum functionality as one can always pipe to another command for formatting. jq can do this, for example: https://stackoverflow.com/questions/39139107/how-to-format-a-json-string-as-a-table-using-jq

Although that migt be to intricate. An extra switch to format the ouput can also be added.

plainas avatar Oct 14 '19 13:10 plainas

I checked prettytable. What a nice little library! No dependencies. To the point.

Posting the link in here for future me :) https://github.com/jazzband/prettytable

plainas avatar Oct 14 '19 14:10 plainas

Ahaha, maybe we have been using the same bookmarklet ^^ Mine is here: https://github.com/Lucas-C/dotfiles_and_notes/blob/master/languages/web-d3/bookmarklets.md#table2csv

Glad you like the idea :)

Lucas-C avatar Oct 14 '19 14:10 Lucas-C

Ok, let's get the ball rolling on this one.

UI

Activate this feature with a CLI flag. Perhaps ''-T'' or ''--table''. I'm tempted to support only the latest as ''-t'' is already taken for tex,t and having -T and -t doing two very different things may be difficult to remember. Although there's already ''-J'' and ''-j''.

It might be useful to include an extra flag to omit the table headers.

Behavior

Is there any use in being able to select the html inside table cells? I am not sure may people, if any would have an use for this. Supporting innerText only feels like the way to go. Which is to say that, if we do choose ''-T'', it would imply ''-Tt''.

Select just one table element. If more than one matches, pick the first. This is not how tq behaves otherwise, but I don't see much of a use case for extracting many tables at once. Now that there i support for fancier css selectors, it's possible to use for example ''nth-match'' to get the desired table.

How strict should we be with selection? It might introduce a bit of confusion, but I've been thinking about relaxing selection a little bit and look for the first table element inside the selected node. For example, one could pass ''body'' as a quick efortless way to retrieve the data from the first table in the page.

Output

This is the trickiest. Pipeline composability is an important goal. This is a command line tool in the tradition of classic unix principles. ASCII art formats such the one from pretty-table are suboptimal for this purpose, so I think that they should not be the default. I'm leaning towards json lines for the default. Composability is retained to an extent trough other tools like jq.

Perhaps a format that outputs the text of each row in a single line could be useful for simple tables that contain numerical data, in the sense that they can be easil processed with awk or similar. Bu json lnes feels like the least brittle in my opinion.

What do you think @Lucas-C ?

plainas avatar Oct 31 '19 17:10 plainas

IMHO:

  1. --table sounds great !
  2. I'm totally Ok with the behaviour you described :)
  3. I'd vote for a dedicated option for JSON output, maybe --to-json, but have the default behaviour being a nicely formatted ASCII table. Because it would be more user-friendly, especially for newcomers to the tool, while allowing the to-json "composable" output to be also enabled in other modes (e.g. extracting <ul> / <ol> elements one day maybe ?)

Lucas-C avatar Oct 31 '19 18:10 Lucas-C

Examples of possible test data:

Premier League standings: https://www.espn.com/soccer/table/_/league/eng.1

Downjones top movers: https://money.cnn.com/data/dow30/

Plenty of per-country data on GDP, wikipedia. plenty of clean tables https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(PPP)

plainas avatar Nov 25 '19 01:11 plainas

example of using lynx to render html

$ curl -s "https://www.espn.com/soccer/table/_/league/eng.1" | tq ".Table__Scroller > table:nth-child(1)" | lynx -stdin -dump
    [1]GP [2]W [3]D [4]L [5]F [6]A [7]GD [8]P
    34    25   5    4    71   24   +47   80
    33    19   10   4    64   35   +29   67
    34    19   6    9    61   39   +22   63
    34    17   10   7    53   31   +22   61
    34    17   7    10   55   44   +11   58
    34    16   8    10   60   38   +22   56
    33    15   9    9    55   39   +16   54
    33    15   7    11   45   42   +3    52
    34    14   7    13   46   37   +9    49
    33    14   6    13   48   38   +10   48
    34    14   5    15   50   52   -2    47
    34    11   9    14   33   46   -13   42
    33    10   8    15   34   56   -22   38
    34    8    13   13   35   39   -4    37
    33    10   7    16   41   59   -18   37
    34    9    9    16   31   47   -16   36
    34    9    9    16   36   56   -20   36
    34    5    12   17   25   45   -20   27
    34    5    11   18   31   65   -34   26
    34    5    2    27   18   60   -42   17

References

   1. file:///soccer/standings/_/league/ENG.1/sort/gamesplayed/dir/desc
   2. file:///soccer/standings/_/league/ENG.1/sort/wins/dir/desc
   3. file:///soccer/standings/_/league/ENG.1/sort/ties/dir/desc
   4. file:///soccer/standings/_/league/ENG.1/sort/losses/dir/asc
   5. file:///soccer/standings/_/league/ENG.1/sort/pointsfor/dir/desc
   6. file:///soccer/standings/_/league/ENG.1/sort/pointsagainst/dir/asc
   7. file:///soccer/standings/_/league/ENG.1/sort/pointdifferential/dir/desc
   8. file:///soccer/standings/_/league/ENG.1/sort/points/dir/desc

i think this is good enough for my use case

rachmadaniHaryono avatar May 04 '21 16:05 rachmadaniHaryono

Yes, such usage has always worked, but it is brittle if the data contains spaces, or generally speaking whatever is used as a separator.

What we are discussing is treating it as a special case so it can be output in a reliably parseable format. Specifically json.

The html is already parsed with beautifulsoup, so we have acess to the data element by element.

I forgot about this ticket. I guess I haven't needed this lately.

On Tue, May 4, 2021 at 6:20 PM rachmadani haryono @.***> wrote:

example of using lynx to render html

$ curl -s "https://www.espn.com/soccer/table/_/league/eng.1" | tq ".Table__Scroller > table:nth-child(1)" | lynx -stdin -dump [1]GP [2]W [3]D [4]L [5]F [6]A [7]GD [8]P 34 25 5 4 71 24 +47 80 33 19 10 4 64 35 +29 67 34 19 6 9 61 39 +22 63 34 17 10 7 53 31 +22 61 34 17 7 10 55 44 +11 58 34 16 8 10 60 38 +22 56 33 15 9 9 55 39 +16 54 33 15 7 11 45 42 +3 52 34 14 7 13 46 37 +9 49 33 14 6 13 48 38 +10 48 34 14 5 15 50 52 -2 47 34 11 9 14 33 46 -13 42 33 10 8 15 34 56 -22 38 34 8 13 13 35 39 -4 37 33 10 7 16 41 59 -18 37 34 9 9 16 31 47 -16 36 34 9 9 16 36 56 -20 36 34 5 12 17 25 45 -20 27 34 5 11 18 31 65 -34 26 34 5 2 27 18 60 -42 17

References

  1. file:///soccer/standings/_/league/ENG.1/sort/gamesplayed/dir/desc
  2. file:///soccer/standings/_/league/ENG.1/sort/wins/dir/desc
  3. file:///soccer/standings/_/league/ENG.1/sort/ties/dir/desc
  4. file:///soccer/standings/_/league/ENG.1/sort/losses/dir/asc
  5. file:///soccer/standings/_/league/ENG.1/sort/pointsfor/dir/desc
  6. file:///soccer/standings/_/league/ENG.1/sort/pointsagainst/dir/asc
  7. file:///soccer/standings/_/league/ENG.1/sort/pointdifferential/dir/desc
  8. file:///soccer/standings/_/league/ENG.1/sort/points/dir/desc

i think this is good enough for my use case

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/plainas/tq/issues/17#issuecomment-832069663, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHRBGO7SF2S7TGED42QNJDTMANDDANCNFSM4JAN6NMQ .

plainas avatar May 05 '21 15:05 plainas