View Single Post
Old 26th April 2010
vermaden's Avatar
vermaden vermaden is offline
Administrator
 
Join Date: Apr 2008
Location: pl_PL.lodz
Posts: 1,056
Default

Quote:
Originally Posted by IdOp View Post
hi vermaden, thanks for posting your script. I finally had a chance to take a quick look at it and learned some things from it.

Although I don't understand every last obscure corner of it, I do have some comments/questions.
Thanks, comment are welcome of course

Quote:
Originally Posted by IdOp View Post
They all really concern __md5().

Code:
| while read LINE
      do
        COUNT=$( echo ${LINE} | awk '{print $1}' )
        [ ${COUNT} -eq 1 ] && continue
        SUM=$( echo ${LINE} | awk '{print $2}' )
The above, which occurs twice, could be more simply replaced with
Code:
| while read COUNT SUM
      do
        [ ${COUNT} -eq 1 ] && continue
You are right, I forgot that read is able to read multiple variables

Quote:
Originally Posted by IdOp View Post
Next,
Code:
echo "${DATA}" | grep ${SUM} >> ${DUPLICATES_FILE}
This could have problems if a filename contains the md5sum of another file. (Perhaps not likely to happen, but you never know what people will do.) It would seem safer to check that the first field is right.
Mhmm, I do not get it, what is the problem if a file contains an MD5 sum, what problems?

Quote:
Originally Posted by IdOp View Post
The final loop of __md5() again starts with
Code:
echo "${DATA}" \
Wouldn't it be more efficient to base this loop on something like
Code:
cat ${DUPLICATES_FILE}
since this would avoid the need to re-filter out all the non-dup's which you've already done? Of course, the loop content must be adapted for this.
Mhmm, good tip, I must look at it, I generally always try to use things that are already in memory or in other variables, instead of trying to read a file from disk, but in that case this can be better, I must take deeper look on that.

Quote:
Originally Posted by IdOp View Post
Next,
Code:
rm -rf ${DUPLICATES_FILE}
You could omit the "r" here.
Yes, habits

... but at least it does not do any harm

Quote:
Originally Posted by IdOp View Post
Finally, concerning speed: It seems like you're computing the md5sum of every file in the tree. Maybe you want to do that for theoretical reasons, but have you considered doing the md5 check as an add-on after the size method? In other words, use size as a quick hash to get potential duplicates (all dup files must have the same size) and then go at only those with md5. This might speed things up a lot. (It might also add infinitessimally to reliability since files with the same md5sum but different sizes would not be identified. )
After messing with backups of my gf mostly I foud a lot of files that have had diffrent name that was the same content, often pictures, movies, mp3 files, pdf, other documents, almost everything, IMHO separate -s/-S option is enought for checking size and md5 sum does its job, but its always nice to know how other see that, thanks.

Quote:
Originally Posted by IdOp View Post
Anyway, maybe one of those comments is useful. Thanks again, and good luck with it.
Thanks mate.
__________________
religions, worst damnation of mankind
"If 386BSD had been available when I started on Linux, Linux would probably never had happened." Linus Torvalds

Linux is not UNIX! Face it! It is not an insult. It is fact: GNU is a recursive acronym for “GNU's Not UNIX”.
vermaden's: links resources deviantart spreadbsd
Reply With Quote