![]() |
|
Guides All Guides and HOWTO's. |
![]() |
|
Thread Tools | Display Modes |
|
||||
![]()
I have written this script some time ago to find duplicated files, they may be compared by file name, size or md5 checksum. Feel free to point any issues with it.
It works with FreeBSD and Linux, partially ported to Solaris, but if md5(1) and stat(1) have the same syntax as in FreeBSD (or Linux) it may also be used on other BSDs or any other UNIX system. Example output/usage: Code:
% duplicated_files.sh usage: duplicated_files.sh OPTION DIRECTORY OPTIONS: -n check by name (fast) -s check by size (medium) -m check by md5 (slow) -N same as '-n' but with delete instructions printed -S same as '-s' but with delete instructions printed -M same as '-m' but with delete instructions printed EXAMPLE: duplicated_files.sh -s /mnt Code:
% duplicated_files.sh -m tmp count: 2 | md5: eb36b88619424b05288a0a8918b822f0 tmp/segoeuib.ttf tmp/test/segoeuib.ttf count: 3 | md5: 4e1e3521a4396110e59229bed85b0cf9 tmp/cam/fdd/file.htm tmp/cam/gf/file.htm tmp/cam/nf7/file.htm Code:
% duplicated_files.sh -N tmp count: 2 | file: segoeuil.ttf sudo rm -rf "tmp/segoeuil.ttf" sudo rm -rf "tmp/test/segoeuil.ttf" count: 3 | file: file.htm sudo rm -rf "tmp/cam/nf7/file.htm" sudo rm -rf "tmp/cam/gf/file.htm" sudo rm -rf "tmp/cam/fdd/file.htm" Code:
#! /bin/sh # find duplicated files in directory tree # comparing by file NAME, SIZE or MD5 checksum # -------------------------------------------- # LICENSE(s): BSD / CDDL # -------------------------------------------- # vermaden [AT] interia [DOT] pl # http://strony.toya.net.pl/~vermaden/links.htm __usage() { echo "usage: $( basename ${0} ) OPTION DIRECTORY" echo " OPTIONS: -n check by name (fast)" echo " -s check by size (medium)" echo " -m check by md5 (slow)" echo " -N same as '-n' but with delete instructions printed" echo " -S same as '-s' but with delete instructions printed" echo " -M same as '-m' but with delete instructions printed" echo " EXAMPLE: $( basename ${0} ) -s /mnt" exit 1 } __prefix() { case $( id -u ) in (0) PREFIX="rm -rf" ;; (*) case $( uname ) in (SunOS) PREFIX="pfexec rm -rf" ;; (*) PREFIX="sudo rm -rf" ;; esac ;; esac } __crossplatform() { case $( uname ) in (FreeBSD) MD5="md5 -r" STAT="stat -f %z" ;; (Linux) MD5="md5sum" STAT="stat -c %s" ;; (SunOS) echo "INFO: supported systems: FreeBSD Linux" echo echo "Porting to Solaris/OpenSolaris" echo " -- provide values for MD5/STAT in '$( basename ${0} ):__crossplatform()'" echo " -- use digest(1) instead for md5 sum calculation" echo " $ digest -a md5 file" echo " -- pfexec(1) is already used in '$( basename ${0} ):__prefix()'" echo exit 1 (*) echo "INFO: supported systems: FreeBSD Linux" exit 1 ;; esac } __md5() { __crossplatform :> ${DUPLICATES_FILE} DATA=$( find "${1}" -type f -exec ${MD5} {} ';' | sort -n ) echo "${DATA}" \ | awk '{print $1}' \ | uniq -c \ | while read LINE do COUNT=$( echo ${LINE} | awk '{print $1}' ) [ ${COUNT} -eq 1 ] && continue SUM=$( echo ${LINE} | awk '{print $2}' ) echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE} done echo "${DATA}" \ | awk '{print $1}' \ | sort -n \ | uniq -c \ | while read LINE do COUNT=$( echo ${LINE} | awk '{print $1}' ) [ ${COUNT} -eq 1 ] && continue SUM=$( echo ${LINE} | awk '{print $2}' ) echo "count: ${COUNT} | md5: ${SUM}" grep ${SUM} ${DUPLICATES_FILE} \ | cut -d ' ' -f 2-10000 2> /dev/null \ | while read LINE do if [ -n "${PREFIX}" ] then echo " ${PREFIX} \"${LINE}\"" else echo " ${LINE}" fi done echo done rm -rf ${DUPLICATES_FILE} } __size() { __crossplatform find "${1}" -type f -exec ${STAT} {} ';' \ | sort -n \ | uniq -c \ | while read LINE do COUNT=$( echo ${LINE} | awk '{print $1}' ) [ ${COUNT} -eq 1 ] && continue SIZE=$( echo ${LINE} | awk '{print $2}' ) SIZE_KB=$( echo ${SIZE} / 1024 | bc ) echo "count: ${COUNT} | size: ${SIZE_KB}KB (${SIZE} bytes)" if [ -n "${PREFIX}" ] then find ${1} -type f -size ${SIZE}c -exec echo " ${PREFIX} \"{}\"" ';' else find ${1} -type f -size ${SIZE}c -exec echo " {}" ';' fi echo done } __file() { __crossplatform find "${1}" -type f \ | xargs -n 1 basename 2> /dev/null \ | tr '[A-Z]' '[a-z]' \ | sort -n \ | uniq -c \ | sort -n -r \ | while read LINE do COUNT=$( echo ${LINE} | awk '{print $1}' ) [ ${COUNT} -eq 1 ] && break FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null ) echo "count: ${COUNT} | file: ${FILE}" FILE=$( echo ${FILE} | sed -e s/'\['/'\\\['/g -e s/'\]'/'\\\]'/g ) if [ -n "${PREFIX}" ] then find ${1} -iname "${FILE}" -exec echo " ${PREFIX} \"{}\"" ';' else find ${1} -iname "${FILE}" -exec echo " {}" ';' fi echo done } # main() [ ${#} -ne 2 ] && __usage [ ! -d "${2}" ] && __usage DUPLICATES_FILE="/tmp/$( basename ${0} )_DUPLICATES_FILE.tmp" case ${1} in (-n) __file "${2}" ;; (-m) __md5 "${2}" ;; (-s) __size "${2}" ;; (-N) __prefix; __file "${2}" ;; (-M) __prefix; __md5 "${2}" ;; (-S) __prefix; __size "${2}" ;; (*) __usage ;; esac
__________________
religions, worst damnation of mankind "If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”. vermaden's: links resources deviantart spreadbsd |
|
||||
![]()
Being lazy, I tend to just use `diff -ru dir1 dir2`, and worry about the brevity that using checksums would offer, when I need to think of a grep string or awk program to filter the results from diff.
lol. EDIT: by the way CS, code like this: Code:
# Automagic slash/backslash conversion doesn't work with pythonpath. if os.path.isdir('../aragorn'): if sys.platform[:3] == 'win': sys.path.append('..\\aragorn') else: sys.path.append('../aragorn') At least, assuming you like ease of porting between OSes, and easier to maintain scripts. (My only real gripe with os.path is the inconsistencies of os.path.expandvars(), where the NT module has the best, and every other implementation sucks)
__________________
My Journal Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''. Last edited by TerryP; 24th April 2010 at 03:35 PM. |
|
||||
![]()
Because I forgot to change that one to NOT advertise fscked up shitty polish eBay replacement.
__________________
religions, worst damnation of mankind "If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”. vermaden's: links resources deviantart spreadbsd |
|
|||
![]()
I will try it
thanks |
|
||||
![]() Quote:
Maybe I should involve some filter in the middle before cut to make it print it, but the name will be at most only 'close' to the real one, and b instructions will not work. Other solution may be first find all files with incorrect names, print them on the screen telling that we will omit them as long as they have 'bad' characters. It would be also great to have it simplified with directories, that and that dirs are identical, but such comparison would take ages to compute, and even more if there are 3 or more of duplicates, but maybe I will find some nice way to compare directories ... but I have absolutely no idea how to exclude more in depth directories that are in same dirs for example, to not produce useless output, like that: Code:
count: 2 /home/dir1 /home/backup/dir1 count: 2 /home/dir1/include /home/backup/dir1/include ![]()
__________________
religions, worst damnation of mankind "If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”. vermaden's: links resources deviantart spreadbsd |
|
||||
![]()
You could always pipe it into a pager or text editor
![]()
__________________
My Journal Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''. |
|
||||
![]()
BTW, the md5 sum variant (-m/-M) takes about 6 hours on 1TB data (raid5 on 3 * regular 7200 rpm disk) to check all duplicates (md5 is the slowest mode, so other modes will be a lot faster).
__________________
religions, worst damnation of mankind "If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”. vermaden's: links resources deviantart spreadbsd |
|
||||
![]()
hi vermaden, thanks for posting your script. I finally had a chance to take a quick look at it and learned some things from it.
![]() Although I don't understand every last obscure corner of it, I do have some comments/questions. They all really concern __md5(). Code:
| while read LINE do COUNT=$( echo ${LINE} | awk '{print $1}' ) [ ${COUNT} -eq 1 ] && continue SUM=$( echo ${LINE} | awk '{print $2}' ) Code:
| while read COUNT SUM do [ ${COUNT} -eq 1 ] && continue Code:
echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE} The final loop of __md5() again starts with Code:
echo "${DATA}" \ Code:
cat ${DUPLICATES_FILE} Next, Code:
rm -rf ${DUPLICATES_FILE} Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified. ![]() Anyway, maybe one of those comments is useful. Thanks again, and good luck with it. |
|
|||||
![]() Quote:
![]() Quote:
![]() Quote:
Quote:
Yes, habits ![]() ... but at least it does not do any harm ![]() Quote:
Thanks mate.
__________________
religions, worst damnation of mankind "If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”. vermaden's: links resources deviantart spreadbsd |
|
||||
![]() Quote:
Quote:
Oh, one other thought came to mind. For your temporary file, it might be a good idea to use mktemp(1). |
|
||||
![]() Quote:
When the data set is as large as what Vermaden as hinted at, the speed up would definitely be worth while. Since it's a reasonable postulate that two files will have differing checksums if they have different file sizes, the potentional speed up is tremendous over a large data set. Any record file name without another file of equal size, could be ruled out as a duplicate; then every file with other files having the same size, can be checksummed for final decision ((sum=sum)=duplicate). It only ads two delemas:
Then again sequentially processing the contents of a directory without first exclusively locking the entire contents of the directory: and having the OS enforce those locks, is, well a similar problem of its own, as is obtaining the locks suitably xD. For Vermdens purposes, I reckon such concerns are likely of esoteric value only: but the file size driven skip list idea is a great idea.
__________________
My Journal Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''. |
|
||||
![]()
heheh, thanks, glad you enjoyed it TerryP.
Quote:
Quote:
Quote:
|
|
||||
![]() Quote:
![]() There's several ways of implementing the algorithm, but associate array style data structures that can map things, like sizes to filenames, is how most people would likely first engage the problem (e.g. awk/perl thinking). One could actually get away with a simple list: and that can be easily accomplished in portable sh (if you actually know what you're doing), but less naturally than most scripters tend to be accustomed to reading. Using an external file could solve it, but unless the data set is large enough to consume several megs of precious server memory, it's probably not worth the extra effort to process it that way (nor appropriate increases in security conciousness, for having to use temp files). Although one upside would be logging actions becomes easier that way. Even if the memory used without resorting to temp files, was a real issue: it would probably be better to tune it other other ways. (E.g. from sh to C, or Apache to Nginx if it's run on a webserver) Quote:
Yes, there's no complete way around it: most operations can't be guaranteed to be atomic. At the best, you can only minimise the probability rate at which external users/daemons might step on your toes. If the directory being cleaned isn't, for example a cache of files downloaded by a web spider, than it isn't to big a problem. If it was such a cache, it might be considered a feature rather than a bug. The race issue, is more of an issue to enjoy calculating the intellectual implications, than a serious impact on the expected problem domain. I'm also paranoid ![]()
__________________
My Journal Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''. |
|
||||
![]()
... Or you can just use Python/Ruby/Perl/etc ...
Makes things a whole lot easier IMHO. Basic shell scripting is simply too limited for any serious programming, and simple tasks are often easier/faster. If you compare Vermaden's script with mine for example you'll see that mine is actually fewer lines while it does the same, is more readable, more portable, and easier to modify should you want to. ... Just my 2c ... And thank you for the tip on os.path Terry, I actually knew about those functions and back when I first started with Python I always used them, but I found that in most cases it's not really needed so I stopped using them and sort of forgot about it ... |
|
||||
![]()
< once had the same problem with File::Spec.
__________________
My Journal Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''. |
|
||||
![]()
TerryP, thanks for your added interesting comments. I think we're
on the same page about the kind of structures that could be used to implement this. Quote:
Quote:
![]() |
|
||||||
![]() Quote:
![]() Quote:
![]() Quote:
![]() Quote:
[QUOTE=IdOp;31559]heheh, thanks, glad you enjoyed it TerryP. Quote:
![]() Quote:
I also do not intend to write scripts that will work on any other OS kind then UNIX, but this one would also work on Windows, I will only have to install CYGWIN on it (since it does not matter to me if its python.exe or cygwin.exe that I must install anyway), but that is propably a matter or preference.
__________________
religions, worst damnation of mankind "If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”. vermaden's: links resources deviantart spreadbsd |
![]() |
Tags |
dedup, duplicate, file, md5 |
Thread Tools | |
Display Modes | |
|
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
different base directory for same tree | vigol | FreeBSD Ports and Packages | 5 | 7th December 2009 05:12 AM |
GCC in src tree or in ports tree | vigol | FreeBSD Ports and Packages | 2 | 6th December 2009 04:34 PM |
Mount smb in zfs tree | mururoa | FreeBSD Installation and Upgrading | 0 | 15th November 2009 04:02 PM |
Is the source tree frozen until Nov 1st? | lionsong | OpenBSD General | 6 | 7th October 2009 07:22 PM |
strange "~" directory in home directory | gosha | OpenBSD General | 5 | 23rd February 2009 06:12 PM |