DaemonForums - View Single Post - HOWTO: Find Duplicated Files in Directory Tree

vermaden · #20 **(View Single Post)** 27th April 2010

Quote:

Originally Posted by IdOp

Perhaps I didn't explain it well. The problem I (think I) see is if a filename contains an md5sum -- note: the file's name, not the content of the file! So for example, if you had a file called XYZ_4a26b9aa1ba28b018d5b427a16c0e1f8_.html, where the name of that file contains the md5sum 4a26b9aa1ba28b018d5b427a16c0e1f8 for another file, then the grep would pick it up by mistake; unlikely, perhaps, but not what you want. Maybe you could put a ^ on the grep pattern, or use awk?

Now I understant, yes ^ should be there, todo++

Quote:

Originally Posted by IdOp

Oh, one other thought came to mind. For your temporary file, it might be a good idea to use mktemp(1).

Thanks, maybe I would also add that 'feature'

Quote:

Originally Posted by TerryP

When the data set is as large as what Vermaden as hinted at, the speed up would definitely be worth while. Since it's a reasonable postulate that two files will have differing checksums if they have different file sizes, the potentional speed up is tremendous over a large data set. Any record file name without another file of equal size, could be ruled out as a duplicate; then every file with other files having the same size, can be checksummed for final decision ((sum=sum)=duplicate).

Maybe I will use the idea, since it will really speed things up

Quote:

[*]Second is an obvious race condition that can cause files not to be deleted. I.e. if all files of size X have been enqueued for checksuming, and something/someone creates another file of size X at the right point in time, it can be done in such a way that it won't be checksumed along with its older peers.

Its for offline backup, so its static, no changes here, but that is only for my case. In production environment we can temporary provide READ-ONLY support for the directory tree that we will be checking.

[QUOTE=IdOp;31559]heheh, thanks, glad you enjoyed it TerryP.

Quote:

Originally Posted by Carpetsmoker

... Or you can just use Python/Ruby/Perl/etc ...
Makes things a whole lot easier IMHO. Basic shell scripting is simply too limited for any serious programming, and simple tasks are often easier/faster.

I need to learn python some nice day, but currently Oracle 11g database is on todo list (since I participate in Workshop I and II trainings from my job)

Quote:

Originally Posted by Carpetsmoker

If you compare Vermaden's script with mine for example you'll see that mine is actually fewer lines while it does the same, is more readable, more portable, and easier to modify should you want to.

... but mine provides 3 methods to compare files, if you strip comparing by name and size and remove comments from both, it will end up even little smaller/shorter.

I also do not intend to write scripts that will work on any other OS kind then UNIX, but this one would also work on Windows, I will only have to install CYGWIN on it (since it does not matter to me if its python.exe or cygwin.exe that I must install anyway), but that is propably a matter or preference.