DaemonForums  

Go Back   DaemonForums > Miscellaneous > Guides

Guides All Guides and HOWTO's.

Reply
 
Thread Tools Display Modes
  #1   (View Single Post)  
Old 24th April 2010
vermaden's Avatar
vermaden vermaden is offline
Administrator
 
Join Date: Apr 2008
Location: pl_PL.lodz
Posts: 1,052
Thanked 118 Times in 93 Posts
Cool HOWTO: Find Duplicated Files in Directory Tree

I have written this script some time ago to find duplicated files, they may be compared by file name, size or md5 checksum. Feel free to point any issues with it.

It works with FreeBSD and Linux, partially ported to Solaris, but if md5(1) and stat(1) have the same syntax as in FreeBSD (or Linux) it may also be used on other BSDs or any other UNIX system.

Example output/usage:

Code:
% duplicated_files.sh
usage: duplicated_files.sh OPTION DIRECTORY
  OPTIONS: -n   check by name (fast)
           -s   check by size (medium)
           -m   check by md5  (slow)
           -N   same as '-n' but with delete instructions printed
           -S   same as '-s' but with delete instructions printed
           -M   same as '-m' but with delete instructions printed
  EXAMPLE: duplicated_files.sh -s /mnt
Code:
% duplicated_files.sh -m tmp
count: 2 | md5: eb36b88619424b05288a0a8918b822f0
  tmp/segoeuib.ttf
  tmp/test/segoeuib.ttf

count: 3 | md5: 4e1e3521a4396110e59229bed85b0cf9
  tmp/cam/fdd/file.htm
  tmp/cam/gf/file.htm
  tmp/cam/nf7/file.htm
Code:
% duplicated_files.sh -N tmp
count: 2 | file: segoeuil.ttf
  sudo rm -rf "tmp/segoeuil.ttf"
  sudo rm -rf "tmp/test/segoeuil.ttf"

count: 3 | file: file.htm
  sudo rm -rf "tmp/cam/nf7/file.htm"
  sudo rm -rf "tmp/cam/gf/file.htm"
  sudo rm -rf "tmp/cam/fdd/file.htm"
duplicated_files.sh
Code:
#! /bin/sh

# find duplicated files in directory tree
# comparing by file NAME, SIZE or MD5 checksum
# --------------------------------------------
# LICENSE(s): BSD / CDDL
# --------------------------------------------
# vermaden [AT] interia [DOT] pl
# http://strony.toya.net.pl/~vermaden/links.htm

__usage() {
  echo "usage: $( basename ${0} ) OPTION DIRECTORY"
  echo "  OPTIONS: -n   check by name (fast)"
  echo "           -s   check by size (medium)"
  echo "           -m   check by md5  (slow)"
  echo "           -N   same as '-n' but with delete instructions printed"
  echo "           -S   same as '-s' but with delete instructions printed"
  echo "           -M   same as '-m' but with delete instructions printed"
  echo "  EXAMPLE: $( basename ${0} ) -s /mnt"
  exit 1
  }

__prefix() {
  case $( id -u ) in
    (0) PREFIX="rm -rf" ;;
    (*) case $( uname ) in
          (SunOS) PREFIX="pfexec rm -rf" ;;
          (*)     PREFIX="sudo rm -rf"   ;;
        esac
        ;;
  esac
  }

__crossplatform() {
  case $( uname ) in
    (FreeBSD)
      MD5="md5 -r"
      STAT="stat -f %z"
      ;;
    (Linux)
      MD5="md5sum"
      STAT="stat -c %s"
      ;;
    (SunOS)
      echo "INFO: supported systems: FreeBSD Linux"
      echo
      echo "Porting to Solaris/OpenSolaris"
      echo "  -- provide values for MD5/STAT in '$( basename ${0} ):__crossplatform()'"
      echo "  -- use digest(1) instead for md5 sum calculation"
      echo "       $ digest -a md5 file"
      echo "  -- pfexec(1) is already used in '$( basename ${0} ):__prefix()'"
      echo
      exit 1
    (*)
      echo "INFO: supported systems: FreeBSD Linux"
      exit 1
      ;;
  esac
  }

__md5() {
  __crossplatform
  :> ${DUPLICATES_FILE}
  DATA=$( find "${1}" -type f -exec ${MD5} {} ';' | sort -n )
  echo "${DATA}" \
    | awk '{print $1}' \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )
        echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
      done

  echo "${DATA}" \
    | awk '{print $1}' \
    | sort -n \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )
        echo "count: ${COUNT} | md5: ${SUM}"
        grep ${SUM} ${DUPLICATES_FILE} \
          | cut -d ' ' -f 2-10000 2> /dev/null \
          | while read LINE
            do
              if [ -n "${PREFIX}" ]
              then
                echo "  ${PREFIX} \"${LINE}\""
              else
                echo "  ${LINE}"
              fi
            done
        echo
      done
  rm -rf ${DUPLICATES_FILE}
  }

__size() {
  __crossplatform
  find "${1}" -type f -exec ${STAT} {} ';' \
    | sort -n \
    | uniq -c \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SIZE=$( echo ${LINE} | awk '{print $2}' )
        SIZE_KB=$( echo ${SIZE} / 1024 | bc )
        echo "count: ${COUNT} | size: ${SIZE_KB}KB (${SIZE} bytes)"
        if [ -n "${PREFIX}" ]
        then
          find ${1} -type f -size ${SIZE}c -exec echo "  ${PREFIX} \"{}\"" ';'
        else
          find ${1} -type f -size ${SIZE}c -exec echo "  {}" ';'
        fi
        echo
      done
  }

__file() {
  __crossplatform
  find "${1}" -type f \
    | xargs -n 1 basename 2> /dev/null \
    | tr '[A-Z]' '[a-z]' \
    | sort -n \
    | uniq -c \
    | sort -n -r \
    | while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && break
        FILE=$( echo ${LINE} | cut -d ' ' -f 2-10000 2> /dev/null )
        echo "count: ${COUNT} | file: ${FILE}"
        FILE=$( echo ${FILE} | sed -e s/'\['/'\\\['/g -e s/'\]'/'\\\]'/g )
        if [ -n "${PREFIX}" ]
        then
          find ${1} -iname "${FILE}" -exec echo "  ${PREFIX} \"{}\"" ';'
        else
          find ${1} -iname "${FILE}" -exec echo "  {}" ';'
        fi
        echo
      done 
  }

# main()

[ ${#} -ne 2  ] && __usage
[ ! -d "${2}" ] && __usage

DUPLICATES_FILE="/tmp/$( basename ${0} )_DUPLICATES_FILE.tmp"

case ${1} in
  (-n)           __file "${2}" ;;
  (-m)           __md5  "${2}" ;;
  (-s)           __size "${2}" ;;
  (-N) __prefix; __file "${2}" ;;
  (-M) __prefix; __md5  "${2}" ;;
  (-S) __prefix; __size "${2}" ;;
  (*)  __usage ;;
esac
__________________
religions, worst damnation of mankind
"If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds

Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”.
vermaden's: links resources deviantart spreadbsd
Reply With Quote
  #2   (View Single Post)  
Old 24th April 2010
Carpetsmoker's Avatar
Carpetsmoker Carpetsmoker is offline
Real Name: Martin
Old man from scene 24
 
Join Date: Apr 2008
Location: Eindhoven, Netherlands
Posts: 2,068
Thanked 198 Times in 156 Posts
Default

Here's a python script I wrote some time ago that does the same.

I had a customer with a digital camera and she kept copying the same files over and over again to her hard disk, she didn't quite seem to understand that you need to delete the pictures from the camera after you've copied them ...
Communicating with her was difficult not just because of her complete cluelessness when it came to computers, but also because she was far removed from being intelligible because of the heavy Flemish accent and lack of dentures.

Anyway, so I needed to clean up the mess, which how this script was created

I only used this on Windows by the way, don't think I've tried using it on BSD or the likes, it should work though.

Code:
#!/usr/bin/env python
#
# Copyright (c) 2009, Martin Tournoij <mtournoij@aragorn.nl>
#
# Aragorn Computers & Automatisering
# http://www.aragorn.nl/
#
# Check for duplicate entries and remove them based on SHA256 hash.
#

import getopt
import hashlib
import os
import pprint
import sys

# Automagic slash/backslash conversion doesn't work with pythonpath.
if os.path.isdir('../aragorn'):
	if sys.platform[:3] == 'win':
		sys.path.append('..\\aragorn')
	else:
		sys.path.append('../aragorn')

import aragorn

def Usage():
	print "%s [-hpt]" % sys.argv[0]
	print ""
	print "\t-h\tHelp"
	print "\t-p\tPath to dir to check for duplicates."
	print "\t-t\tPath to use as 'trash bin' do not use a subdir of -p"
	print ""

def GetTree(dir, prev, dlist, flist, error, size, verbose=None):
	"""
	Get list of files/dirs recursively
	"""
	try:
		for f in os.listdir(os.path.join(dir, prev)):
			path = os.path.join(prev, f)
			if os.path.isdir(os.path.join(dir, path)):
				if verbose:
					print "Adding directory `%s'" % path
				dlist.append(path)
				GetTree(dir, path, dlist, flist, error, size, verbose)
			else:
				try:
					size[0] += os.path.getsize(os.path.join(dir, path))
					if verbose:
						print "Adding file `%s'" % path
					flist.append(path)
				except:
					if verbose:
						print "Error adding file `%s'" % path
					error.append([path, sys.exc_info()[1]])
	except:
		error.append([path, sys.exc_info()[1]])
		print "Error adding directory `%s'" % path

	return dlist, flist, error, size

if __name__ == '__main__':
	try:
		options, arguments = getopt.getopt(sys.argv[1:], 'hp:t:')
	except getopt.GetoptError:
		msg, opt = sys.exc_info()[1]
		print msg
		print ""
		Usage()
		aragorn.MyExit(1)

	optDict = {
		'path': 'c:/images/',
		'trash': 'c:/trash/'
	}

	for opt, arg in options:
		if opt == '-h':
			Usage()
			aragorn.MyExit(0)
		if opt == '-p':
			optDict['path'] = arg
		if opt == '-t':
			optDict['trash'] = arg

	if not os.path.exists(optDict['path']):
		print "Dir to check `%s' does not exist." % optDict['path']
		aragorn.MyExit(1)

	aragorn.MakeDir(optDict['trash'])

	dlist, flist, error, size = GetTree(optDict['path'], '', [], [], [], [0])

	hashdict = {}
	for fname in flist:
		sha = aragorn.SHA256("%s/%s" % (optDict['path'], fname))
		if not hashdict.has_key(sha):
			hashdict[sha] = [fname]
		else:
			hashdict[sha].append(fname)

	pprint.pprint(hashdict)


for (hash, samefiles) in hashdict.iteritems():
	# Keep first item, move rest to trash
	samefiles.pop(0)

	for f in samefiles:
		src = "%s/%s" % (optDict['path'], f)
		dst = "%s/%s" % (optDict['trash'], f)

		try:
			os.rename(src, dst)
		except OSError:
			print "Error renaming `%s' to `%s'" % (src, dst)
__________________
UNIX was not designed to stop you from doing stupid things, because that would also stop you from doing clever things.
Reply With Quote
  #3   (View Single Post)  
Old 24th April 2010
TerryP's Avatar
TerryP TerryP is offline
Arp Constable
 
Join Date: May 2008
Location: USofA
Posts: 1,547
Thanked 112 Times in 104 Posts
Default

Being lazy, I tend to just use `diff -ru dir1 dir2`, and worry about the brevity that using checksums would offer, when I need to think of a grep string or awk program to filter the results from diff.

lol.



EDIT: by the way CS, code like this:

Code:
# Automagic slash/backslash conversion doesn't work with pythonpath.
if os.path.isdir('../aragorn'):
	if sys.platform[:3] == 'win':
		sys.path.append('..\\aragorn')
	else:
		sys.path.append('../aragorn')
Should never be necessary, just use the os.path module properly! Toggles over / and \ are utterly pointless; generally you should never use either in a file path directly. Both from the perspective of writing a program, and from writing os.path, it is very bad design. This is why things like os.path.sep and numerous path manipulation functions exist: so we don't have to write our own.

At least, assuming you like ease of porting between OSes, and easier to maintain scripts.


(My only real gripe with os.path is the inconsistencies of os.path.expandvars(), where the NT module has the best, and every other implementation sucks)
__________________
My Journal

Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''.

Last edited by TerryP; 24th April 2010 at 03:35 PM.
Reply With Quote
  #4   (View Single Post)  
Old 24th April 2010
IdOp's Avatar
IdOp IdOp is offline
Too dumb for a smartphone
 
Join Date: May 2008
Location: twisting on the daemon's fork(2)
Posts: 563
Thanked 14 Times in 13 Posts
Default

Quote:
Code:
count: 3 | file: allegro.htm
  sudo rm -rf "tmp/cam/nf7/file.htm"
  sudo rm -rf "tmp/cam/gf/file.htm"
  sudo rm -rf "tmp/cam/fdd/file.htm"
Why does it say allegro.htm ?
Reply With Quote
  #5   (View Single Post)  
Old 25th April 2010
vermaden's Avatar
vermaden vermaden is offline
Administrator
 
Join Date: Apr 2008
Location: pl_PL.lodz
Posts: 1,052
Thanked 118 Times in 93 Posts
Default

Quote:
Originally Posted by IdOp View Post
Why does it say allegro.htm ?
Because I forgot to change that one to NOT advertise fscked up shitty polish eBay replacement.
__________________
religions, worst damnation of mankind
"If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds

Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”.
vermaden's: links resources deviantart spreadbsd
Reply With Quote
  #6   (View Single Post)  
Old 25th April 2010
ahosam ahosam is offline
New User
 
Join Date: Apr 2010
Posts: 1
Thanked 0 Times in 0 Posts
Default

I will try it
thanks
Reply With Quote
  #7   (View Single Post)  
Old 26th April 2010
IdOp's Avatar
IdOp IdOp is offline
Too dumb for a smartphone
 
Join Date: May 2008
Location: twisting on the daemon's fork(2)
Posts: 563
Thanked 14 Times in 13 Posts
Default

Quote:
Originally Posted by vermaden View Post
Because I forgot to change that one to NOT advertise fscked up shitty polish eBay replacement.
Ah, ok. Glad it's not a bug. (That kind of thing can be hard to see after staring at it long enough.)
Reply With Quote
  #8   (View Single Post)  
Old 26th April 2010
vermaden's Avatar
vermaden vermaden is offline
Administrator
 
Join Date: Apr 2008
Location: pl_PL.lodz
Posts: 1,052
Thanked 118 Times in 93 Posts
Default

Quote:
Originally Posted by IdOp View Post
Ah, ok. Glad it's not a bug. (That kind of thing can be hard to see after staring at it long enough.)
After using this script for a while, I encountered only one 'unwanted' behavior, when file name uses some strange chars (after encoding problems, etc), name of that file is not printed, cut(1) yells about incorrect byte sequence.

Maybe I should involve some filter in the middle before cut to make it print it, but the name will be at most only 'close' to the real one, and b instructions will not work.

Other solution may be first find all files with incorrect names, print them on the screen telling that we will omit them as long as they have 'bad' characters.

It would be also great to have it simplified with directories, that and that dirs are identical, but such comparison would take ages to compute, and even more if there are 3 or more of duplicates, but maybe I will find some nice way to compare directories ... but I have absolutely no idea how to exclude more in depth directories that are in same dirs for example, to not produce useless output, like that:

Code:
count: 2
  /home/dir1
  /home/backup/dir1

count: 2
  /home/dir1/include
  /home/backup/dir1/include
Generally its little PITA reading output of that script from about 1TB data, but I do not have currently any ideas how to improve it
__________________
religions, worst damnation of mankind
"If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds

Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”.
vermaden's: links resources deviantart spreadbsd
Reply With Quote
  #9   (View Single Post)  
Old 26th April 2010
TerryP's Avatar
TerryP TerryP is offline
Arp Constable
 
Join Date: May 2008
Location: USofA
Posts: 1,547
Thanked 112 Times in 104 Posts
Default

You could always pipe it into a pager or text editor
__________________
My Journal

Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''.
Reply With Quote
Old 26th April 2010
vermaden's Avatar
vermaden vermaden is offline
Administrator
 
Join Date: Apr 2008
Location: pl_PL.lodz
Posts: 1,052
Thanked 118 Times in 93 Posts
Default

BTW, the md5 sum variant (-m/-M) takes about 6 hours on 1TB data (raid5 on 3 * regular 7200 rpm disk) to check all duplicates (md5 is the slowest mode, so other modes will be a lot faster).
__________________
religions, worst damnation of mankind
"If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds

Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”.
vermaden's: links resources deviantart spreadbsd
Reply With Quote
Old 26th April 2010
IdOp's Avatar
IdOp IdOp is offline
Too dumb for a smartphone
 
Join Date: May 2008
Location: twisting on the daemon's fork(2)
Posts: 563
Thanked 14 Times in 13 Posts
Default

hi vermaden, thanks for posting your script. I finally had a chance to take a quick look at it and learned some things from it.

Although I don't understand every last obscure corner of it, I do have some comments/questions. They all really concern __md5().

Code:
| while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )
The above, which occurs twice, could be more simply replaced with
Code:
| while read COUNT SUM
      do
        [ ${COUNT} -eq 1 ] && continue
Next,
Code:
echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
This could have problems if a filename contains the md5sum of another file. (Perhaps not likely to happen, but you never know what people will do.) It would seem safer to check that the first field is right.

The final loop of __md5() again starts with
Code:
echo "${DATA}" \
Wouldn't it be more efficient to base this loop on something like
Code:
cat ${DUPLICATES_FILE}
since this would avoid the need to re-filter out all the non-dup's which you've already done? Of course, the loop content must be adapted for this.

Next,
Code:
rm -rf ${DUPLICATES_FILE}
You could omit the "r" here.

Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified. )

Anyway, maybe one of those comments is useful. Thanks again, and good luck with it.
Reply With Quote
Old 26th April 2010
vermaden's Avatar
vermaden vermaden is offline
Administrator
 
Join Date: Apr 2008
Location: pl_PL.lodz
Posts: 1,052
Thanked 118 Times in 93 Posts
Default

Quote:
Originally Posted by IdOp View Post
hi vermaden, thanks for posting your script. I finally had a chance to take a quick look at it and learned some things from it.

Although I don't understand every last obscure corner of it, I do have some comments/questions.
Thanks, comment are welcome of course

Quote:
Originally Posted by IdOp View Post
They all really concern __md5().

Code:
| while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )
The above, which occurs twice, could be more simply replaced with
Code:
| while read COUNT SUM
      do
        [ ${COUNT} -eq 1 ] && continue
You are right, I forgot that read is able to read multiple variables

Quote:
Originally Posted by IdOp View Post
Next,
Code:
echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
This could have problems if a filename contains the md5sum of another file. (Perhaps not likely to happen, but you never know what people will do.) It would seem safer to check that the first field is right.
Mhmm, I do not get it, what is the problem if a file contains an MD5 sum, what problems?

Quote:
Originally Posted by IdOp View Post
The final loop of __md5() again starts with
Code:
echo "${DATA}" \
Wouldn't it be more efficient to base this loop on something like
Code:
cat ${DUPLICATES_FILE}
since this would avoid the need to re-filter out all the non-dup's which you've already done? Of course, the loop content must be adapted for this.
Mhmm, good tip, I must look at it, I generally always try to use things that are already in memory or in other variables, instead of trying to read a file from disk, but in that case this can be better, I must take deeper look on that.

Quote:
Originally Posted by IdOp View Post
Next,
Code:
rm -rf ${DUPLICATES_FILE}
You could omit the "r" here.
Yes, habits

... but at least it does not do any harm

Quote:
Originally Posted by IdOp View Post
Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified. )
After messing with backups of my gf mostly I foud a lot of files that have had diffrent name that was the same content, often pictures, movies, mp3 files, pdf, other documents, almost everything, IMHO separate -s/-S option is enought for checking size and md5 sum does its job, but its always nice to know how other see that, thanks.

Quote:
Originally Posted by IdOp View Post
Anyway, maybe one of those comments is useful. Thanks again, and good luck with it.
Thanks mate.
__________________
religions, worst damnation of mankind
"If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds

Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”.
vermaden's: links resources deviantart spreadbsd
Reply With Quote
Old 26th April 2010
IdOp's Avatar
IdOp IdOp is offline
Too dumb for a smartphone
 
Join Date: May 2008
Location: twisting on the daemon's fork(2)
Posts: 563
Thanked 14 Times in 13 Posts
Default

Quote:
Originally Posted by vermaden View Post
Mhmm, I do not get it, what is the problem if a file contains an MD5 sum, what problems?
Perhaps I didn't explain it well. The problem I (think I) see is if a filename contains an md5sum -- note: the file's name, not the content of the file! So for example, if you had a file called XYZ_4a26b9aa1ba28b018d5b427a16c0e1f8_.html, where the name of that file contains the md5sum 4a26b9aa1ba28b018d5b427a16c0e1f8 for another file, then the grep would pick it up by mistake; unlikely, perhaps, but not what you want. Maybe you could put a ^ on the grep pattern, or use awk?

Quote:
Thanks mate.
You're welcome.

Oh, one other thought came to mind. For your temporary file, it might be a good idea to use mktemp(1).
Reply With Quote
Old 26th April 2010
TerryP's Avatar
TerryP TerryP is offline
Arp Constable
 
Join Date: May 2008
Location: USofA
Posts: 1,547
Thanked 112 Times in 104 Posts
Default

Quote:
Originally Posted by IdOp View Post
Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified. )
That is actually a refreshing perspective IdOp: you've brought a smile to this programmers heart.

When the data set is as large as what Vermaden as hinted at, the speed up would definitely be worth while. Since it's a reasonable postulate that two files will have differing checksums if they have different file sizes, the potentional speed up is tremendous over a large data set. Any record file name without another file of equal size, could be ruled out as a duplicate; then every file with other files having the same size, can be checksummed for final decision ((sum=sum)=duplicate).

It only ads two delemas:
  • First that although doing that algorithm is not likely to be hard, it is more naturally done using hashes (as in ksh) then the usual portable sh trick of treating a scalar $variable as a list of words: which can be manipulated using filters and variable=`assignments` (or $() if a modern sh is guaranteed: older sh's only supported ``). Lisp is quite a bit better at list processing then generic shell scripting.
  • Second is an obvious race condition that can cause files not to be deleted. I.e. if all files of size X have been enqueued for checksuming, and something/someone creates another file of size X at the right point in time, it can be done in such a way that it won't be checksumed along with its older peers.

Then again sequentially processing the contents of a directory without first exclusively locking the entire contents of the directory: and having the OS enforce those locks, is, well a similar problem of its own, as is obtaining the locks suitably xD.

For Vermdens purposes, I reckon such concerns are likely of esoteric value only: but the file size driven skip list idea is a great idea.
__________________
My Journal

Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''.
Reply With Quote
Old 27th April 2010
IdOp's Avatar
IdOp IdOp is offline
Too dumb for a smartphone
 
Join Date: May 2008
Location: twisting on the daemon's fork(2)
Posts: 563
Thanked 14 Times in 13 Posts
Default

Quote:
Originally Posted by TerryP View Post
... brought a smile to this programmers heart.
heheh, thanks, glad you enjoyed it TerryP.

Quote:
It only ads two delemas: [*]First that although doing that algorithm is not likely to be hard, it is more naturally done using hashes (as in ksh) then the usual portable sh trick of treating a scalar $variable as a list of words: which can be manipulated using filters and variable=`assignments` (or $() if a modern sh is guaranteed: older sh's only supported ``). Lisp is quite a bit better at list processing then generic shell scripting.
I'm not too familiar with that aspect of ksh, nor am I awake enough to absorb all of this comment at the moment; also I didn't think much about how to do it. That said, one vague thought was to put the output of the size step into a file, suitably formatted for easy use by the second md5 step. The file is probably cached by the OS anyway, and this is probably a case where the algorithm is more important than the hardware. I guess you might also do something recursively, which maybe is included in your view?

Quote:
[*]Second is an obvious race condition that can cause files not to be deleted. I.e. if all files of size X have been enqueued for checksuming, and something/someone creates another file of size X at the right point in time, it can be done in such a way that it won't be checksumed along with its older peers.
Wouldn't that kind of problem by there anyway? Someone could create or delete a file while find was looking over the tree, say? Just a thought, I don't know nearly enough about such things to be sure. Of course, adding a second step might make the problem worse, yet if the whole thing is faster than a lengthy md5sum on the tree ... but, yes, if such problems exist then it's caveat emptor to the script user.

Quote:
For Vermdens purposes, I reckon such concerns are likely of esoteric value only: but the file size driven skip list idea is a great idea.
Thanks again, I'm glad if it seems like a good idea.
Reply With Quote
Old 27th April 2010
TerryP's Avatar
TerryP TerryP is offline
Arp Constable
 
Join Date: May 2008
Location: USofA
Posts: 1,547
Thanked 112 Times in 104 Posts
Default

Quote:
Originally Posted by IdOp View Post
I'm not too familiar with that aspect of ksh, nor am I awake enough to absorb all of this comment at the moment; also I didn't think much about how to do it. That said, one vague thought was to put the output of the size step into a file, suitably formatted for easy use by the second md5 step. The file is probably cached by the OS anyway, and this is probably a case where the algorithm is more important than the hardware. I guess you might also do something recursively, which maybe is included in your view?
Generally I skip using bash/ksh features like that when possible because it is usually a warning sign that shell script isn't ideal. But in ksh and bash, it's not hard. Can't remember what NetBSDs /bin/sh is, but OpenBSD at least provides a nice korn shell .

There's several ways of implementing the algorithm, but associate array style data structures that can map things, like sizes to filenames, is how most people would likely first engage the problem (e.g. awk/perl thinking). One could actually get away with a simple list: and that can be easily accomplished in portable sh (if you actually know what you're doing), but less naturally than most scripters tend to be accustomed to reading.

Using an external file could solve it, but unless the data set is large enough to consume several megs of precious server memory, it's probably not worth the extra effort to process it that way (nor appropriate increases in security conciousness, for having to use temp files). Although one upside would be logging actions becomes easier that way. Even if the memory used without resorting to temp files, was a real issue: it would probably be better to tune it other other ways. (E.g. from sh to C, or Apache to Nginx if it's run on a webserver)


Quote:
Originally Posted by IdOp View Post
Wouldn't that kind of problem by there anyway? Someone could create or delete a file while find was looking over the tree, say? Just a thought, I don't know nearly enough about such things to be sure. Of course, adding a second step might make the problem worse, yet if the whole thing is faster than a lengthy md5sum on the tree ... but, yes, if such problems exist then it's caveat emptor to the script user.

Yes, there's no complete way around it: most operations can't be guaranteed to be atomic. At the best, you can only minimise the probability rate at which external users/daemons might step on your toes. If the directory being cleaned isn't, for example a cache of files downloaded by a web spider, than it isn't to big a problem. If it was such a cache, it might be considered a feature rather than a bug.

The race issue, is more of an issue to enjoy calculating the intellectual implications, than a serious impact on the expected problem domain. I'm also paranoid
__________________
My Journal

Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''.
Reply With Quote
Old 27th April 2010
Carpetsmoker's Avatar
Carpetsmoker Carpetsmoker is offline
Real Name: Martin
Old man from scene 24
 
Join Date: Apr 2008
Location: Eindhoven, Netherlands
Posts: 2,068
Thanked 198 Times in 156 Posts
Default

... Or you can just use Python/Ruby/Perl/etc ...
Makes things a whole lot easier IMHO. Basic shell scripting is simply too limited for any serious programming, and simple tasks are often easier/faster.

If you compare Vermaden's script with mine for example you'll see that mine is actually fewer lines while it does the same, is more readable, more portable, and easier to modify should you want to.

... Just my 2c ...

And thank you for the tip on os.path Terry, I actually knew about those functions and back when I first started with Python I always used them, but I found that in most cases it's not really needed so I stopped using them and sort of forgot about it ...
__________________
UNIX was not designed to stop you from doing stupid things, because that would also stop you from doing clever things.
Reply With Quote
Old 27th April 2010
TerryP's Avatar
TerryP TerryP is offline
Arp Constable
 
Join Date: May 2008
Location: USofA
Posts: 1,547
Thanked 112 Times in 104 Posts
Default

< once had the same problem with File::Spec.
__________________
My Journal

Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest ``foo'' someone someday shall type ``supercalifragilisticexpialidocious''.
Reply With Quote
Old 27th April 2010
IdOp's Avatar
IdOp IdOp is offline
Too dumb for a smartphone
 
Join Date: May 2008
Location: twisting on the daemon's fork(2)
Posts: 563
Thanked 14 Times in 13 Posts
Default

TerryP, thanks for your added interesting comments. I think we're
on the same page about the kind of structures that could be used to
implement this.

Quote:
Originally Posted by TerryP
Can't remember what NetBSDs /bin/sh is ...
It's a Bourne shell type sh, while a ksh is also part of the base.

Quote:
I'm also paranoid
Then you may be a survivor.
Reply With Quote
Old 27th April 2010
vermaden's Avatar
vermaden vermaden is offline
Administrator
 
Join Date: Apr 2008
Location: pl_PL.lodz
Posts: 1,052
Thanked 118 Times in 93 Posts
Default

Quote:
Originally Posted by IdOp View Post
Perhaps I didn't explain it well. The problem I (think I) see is if a filename contains an md5sum -- note: the file's name, not the content of the file! So for example, if you had a file called XYZ_4a26b9aa1ba28b018d5b427a16c0e1f8_.html, where the name of that file contains the md5sum 4a26b9aa1ba28b018d5b427a16c0e1f8 for another file, then the grep would pick it up by mistake; unlikely, perhaps, but not what you want. Maybe you could put a ^ on the grep pattern, or use awk?
Now I understant, yes ^ should be there, todo++

Quote:
Originally Posted by IdOp View Post
Oh, one other thought came to mind. For your temporary file, it might be a good idea to use mktemp(1).
Thanks, maybe I would also add that 'feature'

Quote:
Originally Posted by TerryP View Post
When the data set is as large as what Vermaden as hinted at, the speed up would definitely be worth while. Since it's a reasonable postulate that two files will have differing checksums if they have different file sizes, the potentional speed up is tremendous over a large data set. Any record file name without another file of equal size, could be ruled out as a duplicate; then every file with other files having the same size, can be checksummed for final decision ((sum=sum)=duplicate).
Maybe I will use the idea, since it will really speed things up

Quote:
[*]Second is an obvious race condition that can cause files not to be deleted. I.e. if all files of size X have been enqueued for checksuming, and something/someone creates another file of size X at the right point in time, it can be done in such a way that it won't be checksumed along with its older peers.
Its for offline backup, so its static, no changes here, but that is only for my case. In production environment we can temporary provide READ-ONLY support for the directory tree that we will be checking.


[QUOTE=IdOp;31559]heheh, thanks, glad you enjoyed it TerryP.

Quote:
Originally Posted by Carpetsmoker View Post
... Or you can just use Python/Ruby/Perl/etc ...
Makes things a whole lot easier IMHO. Basic shell scripting is simply too limited for any serious programming, and simple tasks are often easier/faster.
I need to learn python some nice day, but currently Oracle 11g database is on todo list (since I participate in Workshop I and II trainings from my job)

Quote:
Originally Posted by Carpetsmoker View Post
If you compare Vermaden's script with mine for example you'll see that mine is actually fewer lines while it does the same, is more readable, more portable, and easier to modify should you want to.
... but mine provides 3 methods to compare files, if you strip comparing by name and size and remove comments from both, it will end up even little smaller/shorter.

I also do not intend to write scripts that will work on any other OS kind then UNIX, but this one would also work on Windows, I will only have to install CYGWIN on it (since it does not matter to me if its python.exe or cygwin.exe that I must install anyway), but that is propably a matter or preference.
__________________
religions, worst damnation of mankind
"If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds

Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”.
vermaden's: links resources deviantart spreadbsd
Reply With Quote
Reply

Tags
dedup, duplicate, file, md5

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
different base directory for same tree vigol FreeBSD Ports and Packages 5 7th December 2009 05:12 AM
GCC in src tree or in ports tree vigol FreeBSD Ports and Packages 2 6th December 2009 04:34 PM
Mount smb in zfs tree mururoa FreeBSD Installation and Upgrading 0 15th November 2009 04:02 PM
Is the source tree frozen until Nov 1st? lionsong OpenBSD General 6 7th October 2009 07:22 PM
strange "~" directory in home directory gosha OpenBSD General 5 23rd February 2009 06:12 PM


All times are GMT. The time now is 06:32 AM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content copyright © 2007-2010, the authors
Daemon image copyright ©1988, Marshall Kirk McKusick