l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
November 4: Social gathering
Next Installfest:
TBD
Latest News:
Oct. 10: LUGOD Installfests coming again soon
Page last updated:
2005 Jan 13 12:46

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] [help@google.com: Re: [#19464334] Searching fordotfiles]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] [help@google.com: Re: [#19464334] Searching fordotfiles]



On Thu, Jan 13, 2005 at 03:13:23PM -0500, Peter Jay Salzman wrote:
> Anyone who searches for ".vimrc" means ".vimrc".  In this case, the dot is a
> literal, so ".forward" is as distinct from "forward" as "cat is distinct
> from "dog".

These (Unix dotfiles) are good special case that Google should consider
handling.  General punctuation would be hell, though.

Right now they probably only index ".*forward.*" as "forward", and stick
the page in that word's bucket.  This means they can't search for ".forward"
or "forward?" distinctly from "forward", "forward." or "forward:".

Not quickly, at least.  See my other post about the possibility of using
their own Google Cache feature for searching full text, once the general
term has been found on a set of pages...

I guess you could think of it like this:

  "I want to search for '.forward' on my computer"

  $ find / -type f -exec grep -l "\.forward" {} \;


That'd be slow.  But if we had a list of common terms that are contained
in pages (indexed on a regular basis, but NOT right when you go to search),
it'd be a lot faster.  This is what most search engines do.

A kind of backwards way of doing it could be like this:

  $ find / -type f -exec grep -l "forward" {} \; > files-with-forward.txt
  $ find / -type f -exec grep -l "backward" {} \; > files-with-backward.txt
  $ find / -type f -exec grep -l "upsidedown" {} \; > files-with-upsidedown.txt

(Really, what search engines do is just "find every file", and then rather
than 'grep' for a set of known words, it just looks at "what words are in this
file/page?" and keeps an index of those words, and adds a reference to each
page in it.  If it finds a new word, it just creates a new 'bucket' to store
page references in...)


So okay, now that we have an index of files containing particular words,
we can search for them.  Instead of doing:

  $ find / -type f -exec grep -l "forward" {} \;

we can just do:

  $ cat files-with-forward.txt


MUCH speedier!



So my proposal, albeit also a slow one (on the Internet scale, at least) is
this.  Say we want to find all files with the term ".forward" in them.
First, we take the term and massage it into something we keep track of.
(In this case, a kind of stemming to just the word "forward".)

  $ cat files-with-forward.txt

But, like with Google, that gives us EVERYTHING, despite punctuation.

So we just 'search our Google cache', like so:

  $ grep -l "\.forward" `cat files-with-forward.txt`



*WHEW!*  Make sense?  I hope I didn't make any glaring mistakes. ;)


<snip>
> Would that REALLY cause their database to melt down in panic?

It would if suddenly every variation of a 'word' became its own searchable
thing.  In my above example, we'd go from one 'bucket' labelled
"pages with the word 'forward' in it", to one for every variation...

A bucket for ".forward", a bucket for "forward.", a bucket for
"forward," a bucket for "forward;", a bucket for "forward?", a bucket for
"forward!", ... and so on. :^)  (Oh hey, maybe we want to search for
"forward...", too... distinct from "forward." ;^) )


But, again, I AM arguing that they SHOULD take into account dotfile naming
conventions.  (At LEAST in their  http://www.google.com/linux  sub-site!)


-bill!
bill@newbreedsoftware.com          April shower bring Kompressor power!
http://newbreedsoftware.com/
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
O'Reilly and Associates
For numerous book donations.