DaemonForums - View Single Post - HOWTO: Find Duplicated Files in Directory Tree

IdOp · #15 **(View Single Post)** 27th April 2010

Quote:

Originally Posted by TerryP

... brought a smile to this programmers heart.

heheh, thanks, glad you enjoyed it TerryP.

Quote:

It only ads two delemas: [*]First that although doing that algorithm is not likely to be hard, it is more naturally done using hashes (as in ksh) then the usual portable sh trick of treating a scalar $variable as a list of words: which can be manipulated using filters and variable=`assignments` (or $() if a modern sh is guaranteed: older sh's only supported ``). Lisp is quite a bit better at list processing then generic shell scripting.

I'm not too familiar with that aspect of ksh, nor am I awake enough to absorb all of this comment at the moment; also I didn't think much about how to do it. That said, one vague thought was to put the output of the size step into a file, suitably formatted for easy use by the second md5 step. The file is probably cached by the OS anyway, and this is probably a case where the algorithm is more important than the hardware. I guess you might also do something recursively, which maybe is included in your view?

Quote:

[*]Second is an obvious race condition that can cause files not to be deleted. I.e. if all files of size X have been enqueued for checksuming, and something/someone creates another file of size X at the right point in time, it can be done in such a way that it won't be checksumed along with its older peers.

Wouldn't that kind of problem by there anyway? Someone could create or delete a file while find was looking over the tree, say? Just a thought, I don't know nearly enough about such things to be sure. Of course, adding a second step might make the problem worse, yet if the whole thing is faster than a lengthy md5sum on the tree ... but, yes, if such problems exist then it's caveat emptor to the script user.

Quote:

For Vermdens purposes, I reckon such concerns are likely of esoteric value only: but the file size driven skip list idea is a great idea.

Thanks again, I'm glad if it seems like a good idea.