DaemonForums - View Single Post - HOWTO: Find Duplicated Files in Directory Tree

IdOp · #11 **(View Single Post)** 26th April 2010

hi vermaden, thanks for posting your script. I finally had a chance to take a quick look at it and learned some things from it.

Although I don't understand every last obscure corner of it, I do have some comments/questions. They all really concern __md5().

Code:

| while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )

The above, which occurs twice, could be more simply replaced with

Code:

| while read COUNT SUM
      do
        [ ${COUNT} -eq 1 ] && continue

Next,

Code:

echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}

This could have problems if a filename contains the md5sum of another file. (Perhaps not likely to happen, but you never know what people will do.) It would seem safer to check that the first field is right.

The final loop of __md5() again starts with

Code:

echo "${DATA}" \

Wouldn't it be more efficient to base this loop on something like

Code:

cat ${DUPLICATES_FILE}

since this would avoid the need to re-filter out all the non-dup's which you've already done? Of course, the loop content must be adapted for this.

Next,

Code:

rm -rf ${DUPLICATES_FILE}

You could omit the "r" here.

Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified.

)

Anyway, maybe one of those comments is useful. Thanks again, and good luck with it.