View Single Post
Old 26th April 2010
TerryP's Avatar
TerryP TerryP is offline
Arp Constable
 
Join Date: May 2008
Location: USofA
Posts: 1,547
Default

Quote:
Originally Posted by IdOp View Post
Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified. )
That is actually a refreshing perspective IdOp: you've brought a smile to this programmers heart.

When the data set is as large as what Vermaden as hinted at, the speed up would definitely be worth while. Since it's a reasonable postulate that two files will have differing checksums if they have different file sizes, the potentional speed up is tremendous over a large data set. Any record file name without another file of equal size, could be ruled out as a duplicate; then every file with other files having the same size, can be checksummed for final decision ((sum=sum)=duplicate).

It only ads two delemas:
  • First that although doing that algorithm is not likely to be hard, it is more naturally done using hashes (as in ksh) then the usual portable sh trick of treating a scalar $variable as a list of words: which can be manipulated using filters and variable=`assignments` (or $() if a modern sh is guaranteed: older sh's only supported ``). Lisp is quite a bit better at list processing then generic shell scripting.
  • Second is an obvious race condition that can cause files not to be deleted. I.e. if all files of size X have been enqueued for checksuming, and something/someone creates another file of size X at the right point in time, it can be done in such a way that it won't be checksumed along with its older peers.

Then again sequentially processing the contents of a directory without first exclusively locking the entire contents of the directory: and having the OS enforce those locks, is, well a similar problem of its own, as is obtaining the locks suitably xD.

For Vermdens purposes, I reckon such concerns are likely of esoteric value only: but the file size driven skip list idea is a great idea.
__________________
My Journal

Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''.
Reply With Quote