l i n u x - u s e r s - g r o u p - o f - d a v i s
Next Meeting:
July 7: Social gathering
Next Installfest:
Latest News:
Jun. 14: June LUGOD meeting cancelled
Page last updated:
2005 Jul 08 08:49

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] Matching Contents of Lists
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] Matching Contents of Lists

Lango, Trevor M. wrote:
> I have two lists, not necessarily of the same length.  List #1 has two
> columns.  List #2 has one column.  I would like to do the following:
> Scan list #1 line by line.  If a match for column #1 in list #1 is found
> in list #2, extract the matching lines and put them in a new list (#3).
> Otherwise, leave the contents of lists #1 and #2 as they are.
> If I expected the contents of the first column of each list to match
> exactly (character for character) - this would be a simple task with C++
> or the like.  However, the contents will not necessarily be perfectly
> identical.  I do believe they are nearly identical enough though to use
> pattern matching via Perl or the like.  Personally this is difficult for
> me (as a Perl noob), I know how to scan through a file for a
> pre-determined pattern - I don't understand how to scan through a file
> for a pattern that is essentially given by a line in another file...?  I
> have not found anything in my reading of Perl documentation that
> explains how to read a file and use its contents as an argument for the
> pattern to search for in another file (suggestions on excellent Perl doc
> sources appreciated also!).
> This is what the contents of the lists may look like:
> TALL0047A
> TAL0047A
> TAL047A
> TAL47A
> TA0047A
> TA047A
> TA47A
> T0047A
> T047A
> T47A
> T0047
> T047
> T47
> Examples of matching:
> TALL0047A    TALL047A    match
> TALL0047A    TAL0047A	    not a match
> TALL0047A    TAL0470A	    not a match
> The contents will always be one to four alpha characters followed by one
> to four numeric characters possibly followed by one or two alpha
> characters.
> A match would be defined as the following criteria being met:
> - The last one to four digits being identical (excluding leading zeroes)
> - The first one to four letters being identical
> It is absolutely imperative that any algorithm used does not produce
> false positives - if a line is extracted as a match - it must without a
> doubt actually be a match.  It is not so critical if a possible match is
> passed up.  The lists will contain thousands or tens of thousands of
> entries - just looking for a clever way to automate as much of the
> process as possible.  I expect to have to check a portion of the lists
> by hand - I would simply appreciate reducing the number of lines that
> have to be checked manually.  Perl seems ideal - I'm just not savvy
> enough with it (yet!) to make it work.  I know there are some Perl gurus
> lurking about in LUGOD so if any of you have a spare moment to lend this
> some thought - thanks!
> Thank you in advance for any suggestions!
> - Trevor

Sounds like a good job for a SQL database. this algorithm is in a
language resembling Oracle PL/SQL.

create table one (col1 varchar(10));
create table two (col1 varchar(10), col2 varchar(100));

create or replace function matches(one varchar, other varchar) returns
	if regexp_replace(one,'^([A-Z])*0*([1-9][0-9]*)','\1') !=
           regexp_replace(other,'^([A-Z])*0*([1-9][0-9]*)','\1') then
            return false;
	if regexp_replace(one,'^([A-Z])*0*([1-9][0-9]*)','\2') !=
           regexp_replace(other,'^([A-Z])*0*([1-9][0-9]*)','\2') then
            return false;
	return true;

select two.col2 from one,two where matches(one.col1,two.col1);

I usually have a GPG digital signature included as an attachment.
See http://www.gnupg.org/ for info about these digital signatures.

Attachment: signature.asc
Description: OpenPGP digital signature

vox-tech mailing list

LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
Sunset Systems
Who graciously hosts our website & mailing lists!