V
V copied to clipboard
V Corpus
Similar to this issue, I decided to run Lynn's method on V answers.
Query used. Code:
import csv
import collections
digraphs = collections.Counter()
trigraphs = collections.Counter()
quadgraphs = collections.Counter()
cp1252 = "ǝʒαβγδεζηθ\nвимнтΓΔΘιΣΩ≠∊∍∞₁₂₃₄₅₆ !\"#$%&'()*+,-./0123456789" + \
":;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrst" + \
"uvwxyz{|}~Ƶ€Λ‚ƒ„…†‡ˆ‰Š‹ŒĆŽƶĀ‘’“”•–—˜™š›œćžŸā¡¢£¤¥¦§¨©ª«¬λ®¯°" + \
"±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëì" + \
"íîïðñòóôõö÷øùúûüýþÿ"
with open("QueryResults(2).csv", newline="", encoding="utf-8") as f:
for row in csv.reader(f):
if row[0] == "Post Link":
continue
code = row[1]
if "<pre><code>" not in code:
continue
# Extract the first bit of code
vyxal = (
code.partition("<pre><code>")[2]
.partition("</code></pre>")[0]
.strip()
)
vyxal = vyxal.replace(""", '"')
vyxal = vyxal.replace(">", ">").replace("<", "<")
vyxal = vyxal.replace("&", "&")
for i in range(0, 256):
vyxal = vyxal.replace("&#"+str(i)+";", cp1252[i])
vyxal = vyxal.replace("<esc>", cp1252[0x1b])
alpha = "abcdefghijklmnopqrstuvwxyz"
for idx, i in enumerate(alpha):
vyxal = vyxal.replace("<C-"+i+">", cp1252[idx+1])
vyxal = vyxal.replace("<M-x>", "ø")
if any(vyxal.count(c) >= 10 for c in vyxal):
continue
if len(vyxal) > 100:
continue
for line in vyxal.split("\n"):
for (a, b) in zip(line, line[1:]):
digraphs[a, b] += 1
for (a, b, c) in zip(line, line[1:], line[2:]):
trigraphs[a, b, c] += 1
for (a, b, c, d) in zip(line, line[1:], line[2:], line[3:]):
quadgraphs[a, b, c, d] += 1
with open("most-common.txt", "w", encoding="utf-8") as f:
f.write("2-graphs:\n")
for d, n in digraphs.most_common(30):
f.write("%4d %s\n" % (n, "".join(d)))
f.write("\n3-graphs:\n")
for d, n in trigraphs.most_common(30):
f.write("%4d %s\n" % (n, "".join(d)))
f.write("\n4-graphs:\n")
for d, n in quadgraphs.most_common(30):
f.write("%4d %s\n" % (n, "".join(d)))
Results (displayed in the 05AB1E codepage):
2-graphs:
24 Àñ
24 ./
21 Àé
21 @"
19 $x
17 xx
17 /
17
15 /&
15 dd
15 «©
14 Íî
14 òÍ
14 12
13 2i
13 lD
13 Gp
13 ll
13 é
12 Θ"
12 /d
12 /
12 Yp
11 r
11 lx
11 e
11 ₂
11 Ó.
11 òd
10 kl
3-graphs:
11 ./&
10 Ó./
8 [ae
8 aei
8 eio
8 iou
8 "qp
7 YGp
7 /&ò
7 lxx
7 $xh
7 D@"
6 Í./
6 xx>
6 Ä$x
6 qpx
6 Àé
5 Àé*
5 òͨ
5 Àñ
5 ou]
5 ¨ä«
4 ©î±
4 ¨[a
4 ]«©
4 «©¨
4 «©/
4 òÍî
4 /
4 /12
4-graphs:
8 [aei
8 aeio
8 eiou
8 Ó./&
6 ./&ò
6 "qpx
5 iou]
4 ¨[ae
4 òÄ$x
4 Ä$xh
4 ~"qp
4 :se
4 2i2i
4 ¨ä«©
3 Í./&
3 ¨.«©
3 lxx>
3 iouy
3 ouy]
3 uy]«
3 À|lD
3 Ñ~"q
3 ./&
3 òhYp
3 hYpX
3 :sor
3 éiD@
3 iD@"
3 ₂"qp
3 gÓul
I think there's something in your parsing that is a major oversight. &#
is not really meaningful V code (It's valid, but not exactly 'useful') so there's absolutely no way that's the most common 2-byte sequence in V code. Look at this answer for example: https://codegolf.stackexchange.com/a/124772/31716
The markdown for that answer is
<pre><code>Í.“op
</code></pre>
which renders on SE as
Í.op
Also <C-x>
means "ctrl-x", but it gets treated like it's 5 distinct bytes instead of 1 by this parser, which is why <C
, C-
, and esc
all score so high. It seems like this parser isn't sophisticated enough to handle the way V answers tend to be formatted.
Thanks, I thought V was an SBCS. I'll try to parse <...>
and &#...;
in my analyzer.
Apparently, SE uses CP-1252, so I'll use the 05AB1E codepage to display it. (replacing these sequences into the respective characters)
It is an SBCS, it's just the answers are frequently formatted in "readable mode" with things like <C-a>
, <esc>
, <M-D>
, etc. That's one additional thing that would need to be parsed, <M-x>
means "alt-x" which would mean 'x' with the high bit set in latin9, or ø
.
Done