CSV.jl icon indicating copy to clipboard operation
CSV.jl copied to clipboard

How can I delete a file after reading its content with CSV.Rows ?

Open guilhermebodin opened this issue 4 years ago • 9 comments

I am trying to delete a file after reading its content but I am getting permission denied. This code

using CSV
using Random
Random.seed!(0)
open("test.csv", "w") do f
    for _ in 1:100_000
        Base.write(f, join([randstring('a':'z') for _ in 1:8], ","))
        Base.write(f, "\n")
    end
end
for r in CSV.Rows("test.csv")
    #something
end
rm("test.csv")

produces

ERROR: IOError: unlink("test.csv"): permission denied (EACCES)

guilhermebodin avatar Sep 09 '21 00:09 guilhermebodin

Probably the most reliable way is to do something like:

for r in CSV.Rows("test.csv")
    # something
end
GC.gc(); GC.gc()
rm("test.csv")

the problem is that the file isnt' technically "released" until the CSV.Rows object gets gc-ed. This is somewhat of an open problem in Julia. We could make this a tad more formal by defining a finalizer function for CSV.Rows, but you'd still have to call finalize(rows) when you wanted the file "released".

quinnj avatar Sep 09 '21 02:09 quinnj

well, I can totally call a finalizer, that does not bother me as a user. Currently, my attempt consisted in doing something like this.

rows_iterator = CSV.Rows("test.csv")
for r in rows_iterator
    #somthing
end
rows_iterator = nothing
GC.gc()

guilhermebodin avatar Sep 09 '21 02:09 guilhermebodin

yes, that should work; though sometimes you have to call GC.gc() twice in order to fully collect an object.

quinnj avatar Sep 09 '21 02:09 quinnj

If we go the finalizer route, we'll just have to add some checks in other places like iterate to ensure a CSV.Rows is still "valid" and hasn't been finalized since that would lead to really bad scenarios.

quinnj avatar Sep 09 '21 02:09 quinnj

If you give me a little guidance or a small sketch I will be happy to open a PR. :)

guilhermebodin avatar Sep 09 '21 04:09 guilhermebodin

Alright, sorry for the slow response here, it might be a little hairy, but here's some guidance/sketch, though I'll admit I haven't thought this through all the way to the end (hence a sketch!):

  • add a finalized::Base.RefValue{Bool} field to Rows struct (alternatively we could make this finalized::Threads.Atomic{Bool} if we're worried about thread safety)
  • Add an official CSV.releaseinput user-facing API function; this would set rows.finalized[] = true, and call finalize(rows.ctx.buf), which should release the mmapping of the input file
  • Update the Rows iterate method to check if it's been finalized and if so, return nothing or throw an error
  • Probably need to pass the finalized field to Row2 struct as well, and check if the input has been finalized in getcolumn, though......maybe not, since we're saving the values in the values field. But if they're PosLen values, then they would be invalid, because they just point into the original input, so yeah, I do think we'd need to check if the original buf is still valid in getcolumn for PosLen values at least. This is probably the hairiest part where there could be corner cases. The thing we'd want to avoid is someone having a "row" (CSV.Row2), and then trying to use that row after the CSV.Rows object has had CSV.releaseinput called on it and getcolumn doing something invalid by interacting with a finalized buf.

quinnj avatar Sep 15 '21 05:09 quinnj