xz icon indicating copy to clipboard operation
xz copied to clipboard

Unzipping is too slow

Open aegoroff opened this issue 6 years ago • 12 comments

When i tried to unzip big file (about 3 GiB size in xz and about 18 GiB unpacked) the process was too slow - only 3 GiB of 18 unpacked in about 40 min on my machine. The same file was unpacked for about 5 minutes using 7 zip tool

aegoroff avatar Nov 06 '18 19:11 aegoroff

Thank you for reporting. This is expected and I have following language in README.md:

At this time the package cannot compete with the xz tool regarding compression speed and size.

I haven't found the time so far to work on code optimization. On the plus side there is a lot of potential on improving the situation. Unfortunately I cannot promise when I will work on it.

ulikunitz avatar Nov 08 '18 21:11 ulikunitz

There is work ahead. I left the issue open.

ulikunitz avatar Feb 11 '21 21:02 ulikunitz

I just ran into slow decompression and the (partial) solution is to wrap your reader in bufio.NewReader(). It turns out this library uses ReadByte() a great deal and on unbuffered input this is incredibly slow.

I say "partial" as unfortunately this fails on some inputs with

writeMatch: distance out of range

Very weird that it fails when buffered but works when unbuffered..

alecthomas avatar Feb 19 '21 08:02 alecthomas

Yes, the library doesn't implement its own buffering and because it uses ReadByte it benefits from buffered readers. I should have documented it.

Rationale at the time has been that I wanted to use a buffered reader only if there is a need for it. For instance I didn't want to use a buffered reader for a bytes.Buffer.

A buffered reader shouldn't make a difference for the reading process. The gxz tool is using a buffered reader and I have run extensive tests for it.

Can you provide the file that you want to decompress?

ulikunitz avatar Feb 19 '21 20:02 ulikunitz

Sure, I was decompressing the Zig tarballs from here.

alecthomas avatar Feb 19 '21 22:02 alecthomas

Fixed!

alecthomas avatar Feb 19 '21 22:02 alecthomas

I have now downloaded all 0.8.0 files and decompressed it with the gxz tool, which uses bufio.Reader and there were no problems to decompress all of them.

Please provide:

  • name of the actual file generating issues
  • version of the xz module
  • the code you are using to decompress the file
  • output of go.env

ulikunitz avatar Feb 19 '21 23:02 ulikunitz

Oh you're asking for the failing one, sorry, that wasn't clear - I thought you were asking for one of the slow ones.

alecthomas avatar Feb 19 '21 23:02 alecthomas

This is the one that fails. Interestingly it also fails with github.com/xi2/xz

alecthomas avatar Feb 19 '21 23:02 alecthomas

Hi, this a deb file, which is an ar file. You must do the following:

$ ar xv bzip2_1.0.6-9.2_deb10u1_amd64.deb 
x - debian-binary
x - control.tar.xz
x - data.tar.xz

The two xz files can easily be uncompressed and generate no issues for me. The debian-binary is a plain-text file. Infos about the deb format can be found by the manual page for deb.

ulikunitz avatar Feb 20 '21 08:02 ulikunitz

I reran my test using @alecthomas suggestion. It is still slower than xi2/xz, but it was a huge speedup:

package test

import (
   "archive/tar"
   "bufio"
   "io"
   "os"
   "path"
   "testing"
   ulikunitz "github.com/ulikunitz/xz"
   xi2 "github.com/xi2/xz"
)

const cargo = "cargo-1.54.0-x86_64-pc-windows-gnu.tar.xz"

func readFrom(r io.Reader) error {
   tr := tar.NewReader(r)
   for {
      n, err := tr.Next()
      if err == io.EOF {
         break
      } else if err != nil {
         return err
      } else if n.Typeflag != tar.TypeReg {
         continue
      }
      os.MkdirAll(path.Dir(n.Name), os.ModeDir)
      f, err := os.Create(n.Name)
      if err != nil {
         return err
      }
      defer f.Close()
      f.ReadFrom(tr)
   }
   return nil
}

// 0.905s
func TestUlikunitz(t *testing.T) {
   f, err := os.Open(cargo)
   if err != nil {
      t.Fatal(err)
   }
   defer f.Close()
   r, err := ulikunitz.NewReader(bufio.NewReader(f))
   if err != nil {
      t.Fatal(err)
   }
   if err := readFrom(r); err != nil {
      t.Fatal(err)
   }
}

// 0.614s
func TestXi2(t *testing.T) {
   f, err := os.Open(cargo)
   if err != nil {
      t.Fatal(err)
   }
   defer f.Close()
   r, err := xi2.NewReader(f, 0)
   if err != nil {
      t.Fatal(err)
   }
   if err := readFrom(r); err != nil {
      t.Fatal(err)
   }
}

89z avatar Aug 19 '21 00:08 89z

I used xz to unpack Python-3.11.4.xz. Using Python 3.10 it took 4sec; using Go it took 1m55sec. So I do think Go xz has a speed issue.

I just tried github.com/therootcompany/xz and it took 5sec.

mark-summerfield avatar Aug 19 '23 07:08 mark-summerfield

I posted this two years ago but it got deleted. here is it again. should help with the speed:

package test

import (
   "archive/tar"
   "bufio"
   "github.com/ulikunitz/xz"
   "io"
   "os"
   "path"
   "testing"
)

const cargo = "cargo-1.54.0-x86_64-pc-windows-gnu.tar.xz"

func readFrom(r io.Reader) error {
   tr := tar.NewReader(r)
   for {
      n, err := tr.Next()
      if err == io.EOF {
         break
      } else if err != nil {
         return err
      } else if n.Typeflag != tar.TypeReg {
         continue
      }
      os.MkdirAll(path.Dir(n.Name), os.ModeDir)
      f, err := os.Create(n.Name)
      if err != nil {
         return err
      }
      defer f.Close()
      f.ReadFrom(tr)
   }
   return nil
}

func TestUlikunitz(t *testing.T) {
   f, err := os.Open(cargo)
   if err != nil {
      t.Fatal(err)
   }
   defer f.Close()
   r, err := xz.NewReader(bufio.NewReader(f))
   if err != nil {
      t.Fatal(err)
   }
   if err := readFrom(r); err != nil {
      t.Fatal(err)
   }
}

ghost avatar Aug 19 '23 14:08 ghost