hi vermaden, thanks for posting your script. I finally had a chance to take a quick look at it and learned some things from it.
Although I don't understand every last obscure corner of it, I do have some comments/questions. They all really concern
__md5().
Code:
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && continue
SUM=$( echo ${LINE} | awk '{print $2}' )
The above, which occurs twice, could be more simply replaced with
Code:
| while read COUNT SUM
do
[ ${COUNT} -eq 1 ] && continue
Next,
Code:
echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
This could have problems if a filename contains the md5sum of another file. (Perhaps not likely to happen, but you never know what people will do.) It would seem safer to check that the first field is right.
The final loop of
__md5() again starts with
Wouldn't it be more efficient to base this loop on something like
Code:
cat ${DUPLICATES_FILE}
since this would avoid the need to re-filter out all the non-dup's which you've already done? Of course, the loop content must be adapted for this.
Next,
Code:
rm -rf ${DUPLICATES_FILE}
You could omit the "r" here.
Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified.
)
Anyway, maybe one of those comments is useful. Thanks again, and good luck with it.