Quote:
Originally Posted by IdOp
hi vermaden, thanks for posting your script. I finally had a chance to take a quick look at it and learned some things from it.
Although I don't understand every last obscure corner of it, I do have some comments/questions.
|
Thanks, comment are welcome of course
Quote:
Originally Posted by IdOp
They all really concern __md5().
Code:
| while read LINE
do
COUNT=$( echo ${LINE} | awk '{print $1}' )
[ ${COUNT} -eq 1 ] && continue
SUM=$( echo ${LINE} | awk '{print $2}' )
The above, which occurs twice, could be more simply replaced with
Code:
| while read COUNT SUM
do
[ ${COUNT} -eq 1 ] && continue
|
You are right, I forgot that
read is able to read multiple variables
Quote:
Originally Posted by IdOp
Next,
Code:
echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
This could have problems if a filename contains the md5sum of another file. (Perhaps not likely to happen, but you never know what people will do.) It would seem safer to check that the first field is right.
|
Mhmm, I do not get it, what is the problem if a file contains an MD5 sum, what problems?
Quote:
Originally Posted by IdOp
The final loop of __md5() again starts with
Wouldn't it be more efficient to base this loop on something like
Code:
cat ${DUPLICATES_FILE}
since this would avoid the need to re-filter out all the non-dup's which you've already done? Of course, the loop content must be adapted for this.
|
Mhmm, good tip, I must look at it, I generally always try to use things that are already in memory or in other variables, instead of trying to read a file from disk, but in that case this can be better, I must take deeper look on that.
Quote:
Originally Posted by IdOp
Next,
Code:
rm -rf ${DUPLICATES_FILE}
You could omit the "r" here.
|
Yes, habits
... but at least it does not do any harm
Quote:
Originally Posted by IdOp
Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified. )
|
After messing with backups of my gf mostly I foud a lot of files that have had diffrent name that was the same content, often pictures, movies, mp3 files, pdf, other documents, almost everything, IMHO separate -s/-S option is enought for checking size and md5 sum does its job, but its always nice to know how other see that, thanks.
Quote:
Originally Posted by IdOp
Anyway, maybe one of those comments is useful. Thanks again, and good luck with it.
|
Thanks mate.