duc icon indicating copy to clipboard operation
duc copied to clipboard

Index not auto-updating

Open drscotthawley opened this issue 3 years ago • 1 comments

Hi, sorry if I missed something basic in the documentation. I thought the point of duc was that, unlike du which has to be run each time you want to know disk usage, duc maintains an index of usage and runs fast, so that I run duc index once (which takes about as long as du) and then can get usage info really fast, much faster than du.

But I'm noticing that when my directories grow, duc ls keeps showing the same old size from when it was initially indexed, i.e. it is not updating to track changes.

How do we enable this?

(If I have to re-run duc index every time I want to see a valid usage list, I might as well just run du. Not interested in graphs, etc., just fast usage info.)

drscotthawley avatar Aug 07 '22 00:08 drscotthawley

"Scott" == Scott H Hawley @.***> writes:

Scott> Hi, sorry if I missed something basic in the documentation. I Scott> thought the point of duc was that, unlike du which has to be Scott> run each time you want to know disk usage, duc maintains an Scott> index of usage and runs fast, so that I run duc index once Scott> (which takes about as long as du) and then can get usage info Scott> really fast, much faster than du.

duc needs to re-index if you want to see any changes made since the last index was run. All duc does is query the index when you ask it questions, it doesn't re-index the disk(s).

Scott> But I'm noticing that when my directories grow, duc ls keeps Scott> showing the same old size from when it was initially indexed, Scott> i.e. it is not updating to track changes.

Corret.

Scott> How do we enable this?

You really don't want this to happen that often. Think how painful running 'du' all the time can be. What most people do is have duc run nightly (or weekly, or whatever schedule) to update the index.

Scott> (If I have to re-run duc index every time I want to see a valid Scott> usage list, I might as well just run du. Not interested in Scott> graphs, etc., just fast usage info.)

Sure, that's what you can do.

Duc really isn't designed for a single user to use on their laptop, it's more for large filesystems on large system which would days hours to complete a 'du' scan, and which would put an unacceptable load on the server and/or/storage subsystem.

By building the index, you can investigate the system and drill down into the details (hey, this directory is now 2tb in size, why?) without having to rerun lots of 'du' commands.

John

l8gravely avatar Aug 09 '22 20:08 l8gravely

Hi, did i understand correctly that I can not update the index with duc, but must scan all files and folders again and again, although they have not changed?

privnote42 avatar Sep 29 '22 02:09 privnote42

did i understand correctly that I can not update the index with duc, but must scan all files and folders again and again, although they have not changed?

Yes you have to re-scan all the files, because otherwise how will duc know when there have been changes? But yes, you can also rebuild the index, though I've found it simpler to just have a cronjob which does:

   foreach f in filesystems; do
   	       duc index -d /tmp/$f.db $f
       if [ $? ]; then
       	  mv /tmp/$f.db /read/path/to/dbs/$f.db
           else 
              echo "error indexing $f, db not updated"
       fi
   done

This is just off the top of my head, and is probably wrong, but the idea is there. If the index builds properly, then move it over the old index. Otherwise bail out.

The idea behind dus is to amortise the cost of a single index run across many accesses to the DB, which is just so much faster. I have some 10tb filesystems with 30 million files. Not having to run 'du' all the time to see what changed is fantastic.

Cheers, John

l8gravely avatar Oct 11 '22 07:10 l8gravely

duc could check the timestamps of the subfolders, couldn't it?

luckycloud-GmbH avatar Oct 12 '22 13:10 luckycloud-GmbH

duc could check the timestamps of the subfolders, couldn't it?

Only if it can also confirm that there are no sub-directories, c.f.,-noleaf option to GNU find.

stuartthebruce avatar Oct 12 '22 15:10 stuartthebruce

ok, that's a valid point. But wouldn't it still be faster to recursively search for new files in sub-directories instead of indexing everything again and again?

luckycloud-GmbH avatar Oct 12 '22 15:10 luckycloud-GmbH

"stuartthebruce" == stuartthebruce @.***> writes:

duc could check the timestamps of the subfolders, couldn't it?

Only if it can also confirm that there are no sub-directories, c.f.,-noleaf option to GNU find.

As Stuart says, there's no way to look at the directory timestamp to know if files/directories have changed more than one level below. Which is why you have to rescan.

And when you have 10Tb of data with 3 million files or more, you don't want to scan very often, you just want to be able to target the low hanging fruit.

Now what might be interesting is a way to find the top N largest files, since they give the most bang for the buck in terms of reducing filesystem usage. I've got a perl script I've used in the past for this, which would email my users. And another script which looked at Netapp quota reports and emailed users as well.

It's a multi-faceted problem space, because collecting the data is expensive, so it really cannot be done realtime.

John

l8gravely avatar Oct 12 '22 15:10 l8gravely

"luckycloud-GmbH" == luckycloud-GmbH @.***> writes:

ok, that's a valid point.

But wouldn't it still be faster to recursively search for new files in sub-directories instead of indexing everything again and again?

Nope, because looking for new files (using find say) is just like duc indexing. It's the same

func findit() { opendir() while readdir() { if dir findit(dir) if file add_to_index } closedir() }

loop, with recursion. Scanning the fileystem is slow when you get to large filesystems, which is why duc only does it once, unless you ask it to re-index to find changes.

Now maybe there's an idea to somehow keep two copies so you can show the change between runs in the DB, but I've spent no more than two seconds thinking about the issues there.

l8gravely avatar Oct 12 '22 15:10 l8gravely

I feel like my question has been answered, given that I misunderstood the basic principles of duc's operation.
Might file this under a "Feature Request" for clarification added to the documentation.

Presumably I could do a cron job that reindexes, say once daily. Related precedent are disk-usage utilities for Windows & Mac where you need to manually re-scan in order for the information to be up-to-date.

So, I understand others may still have issues and questions, but since I'm the one who opened this issue and I consider it to be resolved, I'm closing it.

drscotthawley avatar Oct 12 '22 16:10 drscotthawley

Cron job that reindexes, say once daily.

Good question. The manual is not very explicit about this.

michaelfresco avatar Jan 31 '23 02:01 michaelfresco

Yes you have to re-scan all the files, because otherwise how will duc know when there have been changes?

Couldn't you get file change events with inotify to identify when a file needs updating in the index?

dantheperson avatar May 17 '23 09:05 dantheperson

"dantheperson" == dantheperson @.***> writes:

Yes you have to re-scan all the files, because otherwise how will duc know when there have
been changes?

Couldn't you get file change events with inotify to identify when a file needs updating in the index?

That would imply that duc indexer is running all the time, and that we can insert changes into the middle efficiently. I'd want to batch changes as well.

But remember, duc isn't for real time stats, it's for large volumes with lots and lots of files that takes forever to search by hand. duc does all that work for you and let's you visually mine it for the problem spots.

l8gravely avatar May 23 '23 16:05 l8gravely