SLUG Mailing List Archives
Re: [SLUG] search engine for company network (OT)
- To: Sebastian Spiess <sebastian.spiess@xxxxxxxxx>
- Subject: Re: [SLUG] search engine for company network (OT)
- From: Gerard Blacklock <gblacklo@xxxxxxxxxxxxxx>
- Date: Wed, 14 May 2008 23:46:35 +1000
- Cc: slug@xxxxxxxxxxx
- User-agent: Thunderbird 220.127.116.11 (Windows/20080421)
Sebastian Spiess wrote:
I know this is not a 100% linux related question but it's open source
On our company network we have a daily growing number of documents in
lots of folders and stuff. Most of it is organised in project folders
and has reoccurring folder structures and file names.
We are working hard on giving it more and clearer structure but
sometimes it is still hard to find some files.
I want to suggest to install a search engine which will index our
existing files so that employees can crawl quickly though projects
I've heard of the various desktop search engines like beagle, tracker
and google desktop but are there open source engines which can be run
on a server so that many can connect to it and search?
Sadly we are relying on MS office (2001), AutoCAD (R16 to 2008) and
other proprietary software in our daily work so those kind of files
would need to be indexed.
Does anyone has a idea, something I could investigate further? a
I have a brief look thru the replies so far and no one has mentioned IBM
Omni Find Yahoo Edition (http://omnifind.ibm.yahoo.net/index.php)
probably 'cause its not Opensource :) however it is free and looks to
remain free for some time. I think it might be right up your alley!
I was in the same position as you some time ago, small company with lots
and lots of docs ranging from pdfs in a technical library to CAD files
of differing formats to word, open office documents, pictures etc in
project folders. We have a fairly stringent file system management plan
in place but when you not quite sure what your looking for, a decent
indexed search goes a long long way, especially when looking for that
darn part number buried deep in a CAD drawing which you did 5 years ago
:). Note it picks up .dwg files (even non Autocad ones :) ) and a whole
range of other file types. It has a limit of 200 000 files and a maximum
of 5 collections but this should cover most small business.
Some of the others I tried/looked at:
*Regain* - http://regain.sourceforge.net/ - actually the best after IBM
*Terrier* - http://ir.dcs.gla.ac.uk/terrier/
*Egothor* - http://www.egothor.org/
*Lucene* - http://lucene.apache.org/java/docs/index.html - The IBM
Ominfind uses this also!
Below is a quick note from my work diary when i was researching and
"This was by the far the easiest, most advanced (in terms of
development) and provided the best results from all of the softwares
that I tested and looked at. The only issue when installing was a
missing Java RHEL compatibility package, once this was yummed on my test
server the install went very smoothly.
The software has a web interface for configuring and searching and uses
a port off its own java applet server, Jetty, I think. The download
package includes its own Java runtime environments which alleviates the
pain of trying to get the right version, for that matter, a working
version of Java.
The crawling process is pretty resource hungry but seems very quick for
what it is doing, the results are even more surprising, lots of results
and fairly relevant ones at that, out of all the software that I tried
this picked up the most file types and search the most files. Sometimes
the crawler does not index every file but that is something I am working
on. I currently have it indexing over 200,000 files and it only results
in an index size of 4-5gb, thats without caching the files….
The catch with this software? well it is not entirely Opensource, it
uses the Lucene package but also incorporates some fairly heavy stuff
from Yahoo and IBM, they have stated also they do not plan to make this
particular version a paided one. They have a entrprise version for more
than 500,000 files. The seem to be trying to get a foot in the searching
world by providing a free version to entice people/companies in.
Probably not such a bad idea, g$$gle really needs some proper competition"
Feel free to contact me if you think I can help out - I would be happy
to try! I can even send you a de-sensitized screenie of a typical search
on our server!
"In God we trust, all others bring data"
-- Framed plaque from the '60s, hanging in the Mission Evaluation Room at Johnson Space Center, downstairs from Mission Control.