DaemonForums - View Single Post - HOWTO: Find Duplicated Files in Directory Tree

vermaden · #12 **(View Single Post)** 26th April 2010

Quote:

Originally Posted by IdOp

hi vermaden, thanks for posting your script. I finally had a chance to take a quick look at it and learned some things from it.

Although I don't understand every last obscure corner of it, I do have some comments/questions.

Thanks, comment are welcome of course

Quote:

Originally Posted by IdOp

They all really concern __md5().

Code:

| while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )

The above, which occurs twice, could be more simply replaced with

Code:

| while read COUNT SUM
      do
        [ ${COUNT} -eq 1 ] && continue

You are right, I forgot that read is able to read multiple variables

Quote:

Originally Posted by IdOp

Next,

Code:

echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}

This could have problems if a filename contains the md5sum of another file. (Perhaps not likely to happen, but you never know what people will do.) It would seem safer to check that the first field is right.

Mhmm, I do not get it, what is the problem if a file contains an MD5 sum, what problems?

Quote:

Originally Posted by IdOp

The final loop of __md5() again starts with

Code:

echo "${DATA}" \

Wouldn't it be more efficient to base this loop on something like

Code:

cat ${DUPLICATES_FILE}

since this would avoid the need to re-filter out all the non-dup's which you've already done? Of course, the loop content must be adapted for this.

Mhmm, good tip, I must look at it, I generally always try to use things that are already in memory or in other variables, instead of trying to read a file from disk, but in that case this can be better, I must take deeper look on that.

Quote:

Originally Posted by IdOp

Next,

Code:

rm -rf ${DUPLICATES_FILE}

You could omit the "r" here.

Yes, habits

... but at least it does not do any harm

Quote:

Originally Posted by IdOp

Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified.

)

After messing with backups of my gf mostly I foud a lot of files that have had diffrent name that was the same content, often pictures, movies, mp3 files, pdf, other documents, almost everything, IMHO separate -s/-S option is enought for checking size and md5 sum does its job, but its always nice to know how other see that, thanks.

Quote:

Originally Posted by IdOp

Anyway, maybe one of those comments is useful. Thanks again, and good luck with it.

Thanks mate.