Tugger the SLUGger!SLUG Mailing List Archives

Re: [SLUG] search engine for company network (OT)


Sebastian Spiess wrote:
hi all,

I know this is not a 100% linux related question but it's open source baby :-)

On our company network we have a daily growing number of documents in lots of folders and stuff. Most of it is organised in project folders and has reoccurring folder structures and file names.

We are working hard on giving it more and clearer structure but sometimes it is still hard to find some files.

I want to suggest to install a search engine which will index our existing files so that employees can crawl quickly though projects history.

I've heard of the various desktop search engines like beagle, tracker and google desktop but are there open source engines which can be run on a server so that many can connect to it and search?

Sadly we are relying on MS office (2001), AutoCAD (R16 to 2008) and other proprietary software in our daily work so those kind of files would need to be indexed.


Does anyone has a idea, something I could investigate further? a software name?


cheers, seb

Hi Sebastian,

I have a brief look thru the replies so far and no one has mentioned IBM Omni Find Yahoo Edition (http://omnifind.ibm.yahoo.net/index.php) probably 'cause its not Opensource :) however it is free and looks to remain free for some time. I think it might be right up your alley!

I was in the same position as you some time ago, small company with lots and lots of docs ranging from pdfs in a technical library to CAD files of differing formats to word, open office documents, pictures etc in project folders. We have a fairly stringent file system management plan in place but when you not quite sure what your looking for, a decent indexed search goes a long long way, especially when looking for that darn part number buried deep in a CAD drawing which you did 5 years ago :). Note it picks up .dwg files (even non Autocad ones :) ) and a whole range of other file types. It has a limit of 200 000 files and a maximum of 5 collections but this should cover most small business.

Some of the others I tried/looked at:
*Regain* - http://regain.sourceforge.net/ - actually the best after IBM OmniFind!
*Terrier* - http://ir.dcs.gla.ac.uk/terrier/
*Egothor* - http://www.egothor.org/
*Lucene* - http://lucene.apache.org/java/docs/index.html - The IBM Ominfind uses this also!

Below is a quick note from my work diary when i was researching and trying solutions:

"This was by the far the easiest, most advanced (in terms of development) and provided the best results from all of the softwares that I tested and looked at. The only issue when installing was a missing Java RHEL compatibility package, once this was yummed on my test server the install went very smoothly.

The software has a web interface for configuring and searching and uses a port off its own java applet server, Jetty, I think. The download package includes its own Java runtime environments which alleviates the pain of trying to get the right version, for that matter, a working version of Java.

The crawling process is pretty resource hungry but seems very quick for what it is doing, the results are even more surprising, lots of results and fairly relevant ones at that, out of all the software that I tried this picked up the most file types and search the most files. Sometimes the crawler does not index every file but that is something I am working on. I currently have it indexing over 200,000 files and it only results in an index size of 4-5gb, thats without caching the files….

The catch with this software? well it is not entirely Opensource, it uses the Lucene package but also incorporates some fairly heavy stuff from Yahoo and IBM, they have stated also they do not plan to make this particular version a paided one. They have a entrprise version for more than 500,000 files. The seem to be trying to get a foot in the searching world by providing a free version to entice people/companies in. Probably not such a bad idea, g$$gle really needs some proper competition"


Feel free to contact me if you think I can help out - I would be happy to try! I can even send you a de-sensitized screenie of a typical search on our server!

--
Best Regards,

Gerard

"In God we trust, all others bring data"
-- Framed plaque from the '60s, hanging in the Mission Evaluation Room at Johnson Space Center, downstairs from Mission Control.