l i n u x - u s e r s - g r o u p - o f - d a v i s
Next Meeting:
July 7: Social gathering
Next Installfest:
Latest News:
Jun. 14: June LUGOD meeting cancelled
Page last updated:
2002 Aug 07 13:26

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
RE: [vox-tech] shell script challenge
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [vox-tech] shell script challenge

> I'm using cygwin and I was given the request by my boss to remove all
> duplicate files from the server
> the server is on the x: drive of the windows machine which means that
> cygwin saw it as /cygwin/x
> I forget exactly what command I ran toget checksums.txt
> but it is in the format
> <checksum> *x:<filename>
> The challenge is to find the duplicate checksums and print the file name
> of those checksums.  This is tricky because the directories contain spaces
> which gawk, sed, etc ... see as fields.  Even if I change the IFS to * and
> then use gawk to print the *x:<fname> <checksum> -- sort wouldn't know how
> to deal with it which would make uniq useless (I think).  if I do it the
> other way, <checksum> *x:<filename> sort will work fine but uniq will fail
> because the filename is there.  if I exclude the filename with a gawk ' {
> print $1 } ' then sort and uniq will work fine but I won't have a
> filename.  So all the combinations I can think of fail.  Does anyone know
> how I can find only the duplicate checksums and the file names associated.
> **I realize that with a lbut the problem is that there are 4,575 duplicate
> checksums using:
> cat checksums.txt | awk ' { print $1 } ' | sort -uniq -d | wc -l
> and 46340 files on the server, which seems like it would take an awful
> long time.  any suggestions?

This may not be exactly what you're looking for since it requires Perl, but
this script should do the job. 

WARNING: I haven't tested this script. perl -c claims that it's
syntactically correct, but if this script wipes your filesystem and drinks
all your beer, I'm not responsible. 

# Feed it the checksum file on stdin. 
while (<>)

    $cksum = $1;
    $fname = $2;
    if ($seen{$cksum})
	print "duplicate: $fname\n";
	$seen{$cksum} = 1;
vox-tech mailing list

LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
O'Reilly and Associates
For numerous book donations.