joss icon indicating copy to clipboard operation
joss copied to clipboard

#count reviewers by lang and sectors by lang of the reviewers

Open sylvaticus opened this issue 5 years ago • 2 comments

If you ever got curious. JOSS reviewers data from the public list.

*** The 20 most "best known" languages...
- python         ( 68.74 %)
- r              ( 27.52 %)
- c++            ( 18.85 %)
- c              ( 13.91 %)
- matlab         ( 8.3 %)
- java           ( 7.26 %)
- fortran        ( 5.76 %)
- javascript     ( 4.79 %)
- julia          ( 4.71 %)
- bash           ( 3.07 %)
- go             ( 2.02 %)
- perl           ( 1.65 %)
- c#             ( 1.57 %)
- rust           ( 1.5 %)
- php            ( 1.5 %)
- ruby           ( 1.27 %)
- sql            ( 1.12 %)
- scala          ( 0.9 %)
- haskell        ( 0.82 %)
- cuda           ( 0.75 %)
*** The 20 most "known" languages...
- python         ( 79.43 %)
- r              ( 33.88 %)
- c++            ( 31.41 %)
- c              ( 27.3 %)
- matlab         ( 17.88 %)
- java           ( 16.45 %)
- javascript     ( 12.86 %)
- fortran        ( 10.62 %)
- julia          ( 8.45 %)
- bash           ( 6.36 %)
- perl           ( 4.49 %)
- php            ( 3.89 %)
- c#             ( 3.66 %)
- go             ( 3.14 %)
- rust           ( 2.99 %)
- ruby           ( 2.84 %)
- sql            ( 2.24 %)
- scala          ( 2.09 %)
- html           ( 1.72 %)
- haskell        ( 1.5 %)
*** The 4 most common sectors for the 10 most "known" languages...
python      :   machine learning, bioinformatics, physics, statistics, 
r           :   bioinformatics, machine learning, statistics, genomics, 
c++         :   machine learning, bioinformatics, physics, statistics, 
c           :   machine learning, bioinformatics, astrophysics, statistics, 
matlab      :   machine learning, image processing, statistics, physics, 
java        :   machine learning, bioinformatics, software engineering, data science, 
javascript  :   machine learning, bioinformatics, data science, statistics, 
fortran     :   physics, astrophysics, computational fluid dynamics, computational chemistry, 
julia       :   machine learning, statistics, physics, data science, 
bash        :   bioinformatics, genomics, machine learning, computational biology, 
Generated with the above code (Julia)
# Source: reviewer database of JOSS at https://docs.google.com/spreadsheets/d/1PAPRJ63yq9aPC1COLjaQp8mHmEq3rZUzwUYxTulyu78/edit#gid=856801822

using OdsIO

# Loading data..
dataFile = "joss_reviewers_20200724.ods"
db = ods_read(dataFile,range=((4,2),(1340,9)))

# removing email
db = hcat(db[:,1:2],db[:,5:end])

# replacing "nothing"....
# ..with empty string in the first three columns...
for r in eachrow(db)
    for cidx in 1:3
        r[cidx] = isnothing(r[cidx]) ? "" : r[cidx]
    end
end
# ..and with zero in the number of reviews...
for r in eachrow(db)
    for cidx in 4:6
        r[cidx] = isnothing(r[cidx]) ? 0 : r[cidx]
    end
end

# Converting first 3 columns to string and last 4 to integers
db = convert(Array{Union{String,Int64},2},db)

# Cleaning..
for r in eachrow(db)
    for cidx in 1:3
        # ugly...
        r[cidx] = replace(replace(replace(replace(replace(r[cidx], '/'=>','), '('=>','), ')'=> ','), '\n'=> ',') , "and"=> ',') |> strip |> lowercase
        r[cidx] = replace(r[cidx],", " => ',') # to avoid empty data
        r[cidx] = replace(r[cidx]," ," => ',') # to avoid empty data
        r[cidx] = replace(r[cidx], r",$" => "") # remove ending comma

    end
end

# Establishing vocabolaries
vocLangs = Set{String}()
vocActivities = Set{String}()
for (ridx,r) in enumerate(eachrow(db))
    ##if ridx > 20 break end
    for cidx in 1:2
        #=
        debug = strip.(split(r[cidx],','))
        for l in debug
            if l == ""
                println(l)
                println(ridx)
                println(cidx)
            end
        end
        =#
      if r[cidx] == "" continue end
      push!(vocLangs,strip.(split(r[cidx],','))...)
    end
    for cidx in 3:3
      if r[cidx] == "" continue end
      push!(vocActivities,strip.(split(r[cidx],','))...)
    end
end
vocLangs      = collect(vocLangs)
vocActivities = collect(vocActivities)
langIdx       = Dict{String,Int64}()
[langIdx[l]   = id for (id,l) in enumerate(vocLangs)]
actIdx        = Dict{String,Int64}()
[actIdx[a]    = id for (id,a) in enumerate(vocActivities)]

nLangs             = length(vocLangs)
nActs              = length(vocActivities)
nRecords           = size(db,1)
preferredLangCount = zeros(Int64,nLangs)
competentLangCount = zeros(Int64,nLangs)
actCountByLang     = zeros(Int64,nLangs,nActs)

# Let's count!
for r in eachrow(db)
    plangs = strip.(split(r[1],','))
    olangs = strip.(split(r[2],','))
    langs  = union(Set(plangs),Set(olangs))
    acts   = strip.(split(r[3],','))
    [preferredLangCount[langIdx[l]]       += 1 for l in plangs if l != ""]
    [competentLangCount[langIdx[l]]       += 1 for l in langs if l != ""]
    [actCountByLang[langIdx[l],actIdx[a]] += 1 for l in langs, a in acts if l != "" && a != ""]
end

# Let's report:
n = 20
println("*** The $n most \"best kwown\" languages...")
sortIdx = reverse(sortperm(preferredLangCount))[1:n]
[println("- $(rpad(vocLangs[i],12))\t ( $(round(100*preferredLangCount[i]/nRecords,digits=2)) %)") for i in sortIdx]
n = 20
println("*** The $n most \"known\" languages...")
sortIdx = reverse(sortperm(competentLangCount))[1:n]
[println("- $(rpad(vocLangs[i],12))\t ( $(round(100*competentLangCount[i]/nRecords,digits=2)) %)") for i in sortIdx]

n = 10
n2 = 4
println("*** The $n2 most common sectors for the $n most \"known\" languages...")
sortIdx = reverse(sortperm(competentLangCount))[1:n]
for i in sortIdx
    lang = vocLangs[i]
    sortIdxActs = reverse(sortperm(actCountByLang[i,:]))[1:n2]
    print("$(rpad(lang,12)): \t")
    [print("$(vocActivities[j]), ") for j in sortIdxActs]
    print("\n")
end

sylvaticus avatar Jul 24 '20 12:07 sylvaticus

✨ thanks @sylvaticus! / cc @diehlpk who has been looking at the breakdown of languages of papers we've reviewed too.

arfon avatar Aug 05 '20 10:08 arfon

Ok, this would be interesting to add these to the paper and compare with the programming languages the repos had.

diehlpk avatar Aug 05 '20 16:08 diehlpk