dat-node icon indicating copy to clipboard operation
dat-node copied to clipboard

dat.archive.readFile is very slow and freezes main thread

Open k2s opened this issue 7 years ago • 11 comments

I host public archive of 19333 files and 3.1GB in total. I try to dowload single file with example inside my nodejs server project.

The dat.archive.readFile will call callback after few seconds:

  • which is quite long if I host it on the same PC as running the client part
  • much bigger problem is, that it freezes the whole thread of my server project for the whole time

I found some interesting comment when I was searching investigating in this issue:

maybe we could use async methods writeFile and readFile just in case someone is using dat-node in an application since the sync methods will stop the thread

k2s avatar May 12 '17 06:05 k2s

Hey @k2s, a few seconds between connection and the callback sounds about right to me. Before the read file callback happens, a few things need to go first:

  1. connect to the peer
  2. load the metadata
  3. buffer the file into memory

You can separate each step out a bit to see what happens when, and that may help us see what is taking so long:

Dat('/my-dir', {key: '<key>', sparse: true}, function (err, dat) {
  var network = dat.joinNetwork()
  network.on('connection', function () {
    console.log('connected')
  })

  var archive = dat.archive
  archive.update(function () {
    console.log('metadata received')
  })
  
  archive.readFile('/cat-locations.txt', function (err, content) {
    console.log(content) // prints cat-locations.txt file!
  })

  console.log('archive.readFIle should not block this') // check blocking
})

The connection and metadata loading should be fairly quick. Depending on the file size, it may take a few second to buffer the whole file into memory.

much bigger problem is, that it freezes the whole thread of my server project for the whole time

Regarding blocking, see the last console.log statement. That should print before the file, otherwise we have some bug in hyperdrive. The archive.readFile should be async and not block the thread. However, the content still has to be downloaded all into memory before anything prints.

Hyperdrive also has a file read stream function. Depending on your usage, the stream may be better.

joehand avatar May 12 '17 15:05 joehand

Thank you. Based on your example I tested the download in standalone module

function runme () {
  const hashKey = 'e4be150d67b89b840153f1ef63beba393e2574e7011691cb7d9d4f95c77fe6be'
  const fn = '/Dokumenty/PIL100193.doc'

  const Dat = require('dat-node')

  let time = process.hrtime()
  console.log('START')
  setInterval(() => {
    console.log(`${process.hrtime(time).join('.')}`)
  }, 500)

  Dat('/tmp/dat-test', {
    key: hashKey,
    sparse: true
  }, function (err, dat) {
    if (err) {
      console.error(err)
      process.exit(1)
    }

    // Join the network
    let network = dat.joinNetwork(/*opts*/)
    network.on('connection', function () {
      console.log('connected')
    })

    let archive = dat.archive
    archive.metadata.update(function () {
      console.log('metadata received')
    })

    // archive.readFile(fn, function (err, content) {
    //   console.log(LOG, 'file loaded')
    // })
    let stream = archive.createReadStream(fn)
    stream.on('data', (chunk) => {
      console.log(`Received ${chunk.length} bytes of data.`)
    })

    console.log('archive.readFIle should not block this') // check blocking
  })
}

module.exports = runme()

and it works fast and without freezing (not sure why metadata received is never printed):

START
archive.readFIle should not block this
connected
0.732282505
1.379859997
connected
Received 51712 bytes of data.
1.880138862
2.380918008

Then I required it into my loopback.io server application and it freezes 3 sec (up to 10 sec with --debug-brk):

START
archive.readFIle should not block this
connected
connected
0.938162042
8.984342279
10.572152384
11.777712378
connected
Received 51712 bytes of data.
12.278943600
12.780083554

I never had main thread freezeing issues with my server, but probably the problem is not with Dat itself.

k2s avatar May 14 '17 21:05 k2s

Seems like maybe the serve is just slower to download the data. Is there something that makes you think its freezing, rather than just going slow?

It may be worth seeing what type of connections you're getting (utp is a lot slower right now, still optimizing that):

network.on('connection', function (conn, info) {
  console.log('connected, type:', info.type)
})

joehand avatar May 15 '17 16:05 joehand

There is setIntertval in my code which should print cca. 2 messages per second. When I run the code standalone then you see on the output that it is true.

There are no printouts between 0.938162042 to 8.984342279 when I run the same code as part of my loopback.io based server. That is why I believe that the main thread si frozen and setTimer has no chance to log the time.

Of course I tested that printout with my server app without to start Dat and it is not freezing. So it is problem of combining Dat and my server code.

k2s avatar May 15 '17 18:05 k2s

the connection is TCP:

START
archive.readFIle should not block this
connected, type: tcp
0.950404102
9.65151944
10.691169550
11.976427803
connected, type: tcp
Received 51712 bytes of data.
12.476159894
12.977583605
connected, type: tcp
13.478110319
13.978256003
14.478456696
14.978896758

k2s avatar May 15 '17 18:05 k2s

There is setIntertval in my code which should print cca. 2 messages per second. When I run the code standalone then you see on the output that it is true.

Ah yes, missed that!

Just to check, if you do everything the same but remove the file read stream, does it print out as expected?

joehand avatar May 15 '17 18:05 joehand

@joehand good point. it works fine if I don't request the stream:

START
archive.readFIle should not block this
connected, type: tcp
connected, type: tcp
0.500103554
1.885718
1.503967649
2.4725613
2.506437519

it freezes also with archive.readFile alternative :-(

k2s avatar May 15 '17 18:05 k2s

Ok, one thing down =). What if you try other functions, like archive.stat or archive.readdir?

I'm wondering if it is the downloading the metadata that is causing it to freeze, since its a large archive, instead of the readFile commands.

You could listen to all the download events to see what is going on a bit more:

archive.metadata.on('download', function (block) {
 console.log('block downloaded of size', block.length)
})
// may need to do archive.on('content') first
archive.content.on('download', function (block) {
 console.log('block downloaded of size', block.length)
})

Check out this issue too, seems like similar problems with large metadata perhaps.

joehand avatar May 15 '17 18:05 joehand

this whole code block doesn't do anything for me:

    archive.metadata.update(function () {
      console.log('metadata received')
    })
    archive.metadata.on('download', function (block) {
      console.log('block downloaded of size', block.length)
    })
    // may need to do archive.on('content') first
    archive.content.on('download', function (block) {
      console.log('block downloaded of size', block.length)
    })

I don't see any of this documented in https://github.com/mafintosh/hyperdrive#var-stream--archivecreatereadstreamname-options.

  • archive.stat() on the same file freeezes main thread
  • archive.readdir('/') is fast - only 4 items
  • archive.readdir('/Dokumenty') freezes main thread - 19333 items

The main issue for me is that it does works acceptable as standalone program and the freezing of main thread happens only together with the server code.

k2s avatar May 15 '17 19:05 k2s

The main issue for me is that it does works acceptable as standalone program and the freezing of main thread happens only together with the server code.

Ya this is strange. Is the data already downloaded/cached when you run it as a standalone program?

I don't see any of this documented in https://github.com/mafintosh/hyperdrive#var-stream--archivecreatereadstreamname-options.

Still missing some docs in hyperdrive, sorry. Some of those are documented in hypercore. archive.content and archive.metadata are each a hypercore feed, so you can use anything in the hypercore API on those two objects.

Definitely surprised that those download events aren't firing. The data isn't already downloaded, right?

It looks like the thread freezing is something in accessing the metadata. But it doesn't make sense why it is different for the server (unless the files already downloaded are different).

Thanks for going through my questions, super helpfuL! If you are motivated to, it'd be great to have a test case in hyperdrive or easily reproducible bug. I also have a big dat archive (though its structured to have small directories) at home I can start testing this with later today.

joehand avatar May 15 '17 19:05 joehand

I encountered a similar issue with readDir - and noticed that it was the syncing of the metadata that was stopping readDir’s callback from firing. I opened this as an issue here - https://github.com/mafintosh/hypercore/issues/98 - and @mafintosh suggested the use of cashed option which enables you to read info from the metadata.data without requesting a sync. I also opened a small pr adding the documentation to hyperdrive - https://github.com/mafintosh/hyperdrive/pull/163.

It would be nice to expose some sort of progress on the syncing (perhaps an event emitter), as at the moment for large dat's it does look like it has frozen.

e-e-e avatar May 15 '17 22:05 e-e-e