XLSX.jl icon indicating copy to clipboard operation
XLSX.jl copied to clipboard

File handles are not closed in readtable

Open daniel-thom opened this issue 3 years ago • 3 comments

I'm hitting a file handle leak very consistently when calling XLSX.readtable.

Steps to reproduce:

julia> using XLSX
       for i in 1:1000
           XLSX.readtable("data.xlsx", 1)
       end
ERROR: SystemError: opening file "data.xlsx": Too many open files
Stacktrace:
  [1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
    @ Base ./error.jl:168
  [2] #systemerror#62
    @ ./error.jl:167 [inlined]
  [3] systemerror
    @ ./error.jl:167 [inlined]
  [4] open(fname::String; lock::Bool, read::Nothing, write::Nothing, create::Nothing, truncate::Nothing, append::Nothing)
    @ Base ./iostream.jl:293
  [5] open
    @ ./iostream.jl:282 [inlined]
  [6] ZipFile.Reader(filename::String)
    @ ZipFile ~/.julia/packages/ZipFile/fdYkP/src/ZipFile.jl:118
  [7] open_internal_file_stream
    @ ~/.julia/packages/XLSX/cRAtd/src/stream.jl:48 [inlined]
  [8] iterate(itr::XLSX.SheetRowStreamIterator, state::Nothing)
    @ XLSX ~/.julia/packages/XLSX/cRAtd/src/stream.jl:80
  [9] iterate
    @ ~/.julia/packages/XLSX/cRAtd/src/stream.jl:75 [inlined]
 [10] find_row(itr::XLSX.SheetRowStreamIterator, row::Int64)
    @ XLSX ~/.julia/packages/XLSX/cRAtd/src/stream.jl:256
 [11] eachtablerow(sheet::XLSX.Worksheet, cols::XLSX.ColumnRange; first_row::Int64, column_labels::Nothing, header::Bool, stop_in_empty_row::Bool, stop_in_row_function::Nothing)
    @ XLSX ~/.julia/packages/XLSX/cRAtd/src/table.jl:107
 [12] eachtablerow(sheet::XLSX.Worksheet; first_row::Nothing, column_labels::Nothing, header::Bool, stop_in_empty_row::Bool, stop_in_row_function::Nothing)
    @ XLSX ~/.julia/packages/XLSX/cRAtd/src/table.jl:160
 [13] gettable(sheet::XLSX.Worksheet; first_row::Nothing, column_labels::Nothing, header::Bool, infer_eltypes::Bool, stop_in_empty_row::Bool, stop_in_row_function::Nothing)
    @ XLSX ~/.julia/packages/XLSX/cRAtd/src/table.jl:520
 [14] #23
    @ ~/.julia/packages/XLSX/cRAtd/src/read.jl:574 [inlined]
 [15] openxlsx(f::XLSX.var"#23#24"{Nothing, Nothing, Bool, Bool, Bool, Nothing, Int64}, filepath::String; mode::String, enable_cache::Bool)
    @ XLSX ~/.julia/packages/XLSX/cRAtd/src/read.jl:129
 [16] #readtable#22
    @ ~/.julia/packages/XLSX/cRAtd/src/read.jl:573 [inlined]
 [17] readtable
    @ ~/.julia/packages/XLSX/cRAtd/src/read.jl:573 [inlined]
 [18] top-level scope
    @ ./REPL[1]:3

The file data.xls has two rows and one column which contain 1 (A1) and 2 (A2). For whatever reason, I was not able to reproduce with 1 row.

I'm using XLSX v0.7.6 with Julia v1.6 on a Mac. This probably doesn't happen on Windows because of this line: https://github.com/felipenoris/XLSX.jl/blob/master/src/read.jl#L140

I made that call apply to all systems and the problem stopped happening.

daniel-thom avatar Jul 27 '21 21:07 daniel-thom

@daniel-thom please check if the fix on the master branch fixes your issue.

felipenoris avatar Aug 07 '21 21:08 felipenoris

Yes, the issue no longer occurs.

daniel-thom avatar Aug 07 '21 22:08 daniel-thom

@felipenoris I traced through the code and understand why this issue occurs. It's tied to Base.iterate(itr::SheetRowStreamIterator, state::Union{Nothing, SheetRowStreamIteratorState}=nothing). The first time the function is called it opens a file handle for a ZipFile.Reader. The reader is only closed when the iterator finds the end. In my opinion, this is not a great idea. Someone using the iterator may exit early, leaving the file handle open. The ZipFile.Reader does implement a finalizer to close the handle, but you have no control over when that runs (GC).

The function readtable(filepath::AbstractString, sheet::Union{AbstractString, Int}, ...) calls gettable(getsheet(xf, sheet), ...) which calls eachtablerow(sheet::Worksheet, ...). This results in three independent iterations over a SheetRowStreamIterator. The first two exit early. Only the last is closed. So, one call to readtable will leak two open file handles (unless there is only one row) until GC runs.

  1. https://github.com/felipenoris/XLSX.jl/blob/master/src/table.jl#L140
  2. https://github.com/felipenoris/XLSX.jl/blob/master/src/table.jl#L99
  3. https://github.com/felipenoris/XLSX.jl/blob/master/src/table.jl#L128

daniel-thom avatar Aug 16 '21 01:08 daniel-thom