XLSX.jl icon indicating copy to clipboard operation
XLSX.jl copied to clipboard

deal with phonetic text

Open junpei-n opened this issue 1 year ago • 0 comments

When reading an excel with phonetic hints (ruby) (see sample), each cell.value contains a text and its phonetic text. You can check it, following

# Employment Status Survey / Statistical Tables(Time Series) / Statistical Tables(Time Series) from e-Stat (a portal site for Japanese Government Statistics)
# https://www.e-stat.go.jp/en/stat-search/files?page=1&layout=datalist&toukei=00200532&tstat=000001116777&cycle=0&tclass1=000001116800&stat_infid=000031732265&tclass2val=0
using XLSX, Downloads
f = tempname()
Downloads.download("https://www.e-stat.go.jp/en/stat-search/file-download?statInfId=000031732265&fileKind=0", f)
wb = XLSX.readxlsx(f)
wb[1][:]

It is not common; openpyxl and pandas don't contains phonetic texts in values. It could be fixed by changing gather_strings! function in unformatted_text function, like

function unformatted_text(el::EzXML.Node) :: String

    function gather_strings!(v::Vector{String}, e::EzXML.Node)
        if EzXML.nodename(e) == "t" 
            push!(v, EzXML.nodecontent(e))
        end
        
        for ch in EzXML.eachelement(e)
            if EzXML.nodename(e) != "rPh"  ## !!HERE!!
                gather_strings!(v, ch)
            end 
        end

        nothing
    end
    ...

This change would be reasonable, because a phonetic text is included in "rPh" elements, which can include a "t" element as a child, so applying gather_strings! to a "rPh" node is unnecessary.

junpei-n avatar Jul 21 '22 17:07 junpei-n