The Lemur Toolkit is a free and open source application designed to facilitate research in language modeling and information retrieval. The Lemur Toolkit includes technologies such as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification.
Here are some key features of "The Lemur Project":
· Sophisticated structured query languages (using InQuery and Indri)
· Support for XML and structured document retrieval
· Used commonly with a wide range of research test collections (e.g., TREC CDs 1-5, wt10g, RCV1, gov, gov2)
· Index your web pages with an "out-of-the-box" site search capability
· Interactive interfaces for Windows, Linux, and Web
· Distributed information retrieval and document clustering applications
· Cross-platform, fast and modular code written in C++
· C++, Java and C# APIs
· Free and open-source software
· In use for over 6 years by a large and growing user community
Indexing:
· Multiple indexing methods for small, medium and large-scale (terabyte) collections
· Built-in support for English, Chinese and Arabic text
· Porter and Krovetz word stemming
· Incremental indexing
· Out-of-the-box indexing support for TREC Text, TREC Web, plain text, HTML, XML, PDF, MBox, Microsoft Word, and Microsoft PowerPoint
· Indexes inline and offset text annotations (e.g., part-of-speech and named entities)
· Indexes document attributes
Retrieval:
· Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
· Relevance- and pseudo-relevance feedback
· Wildcard term expansion (using Indri)
· Passage and XML element retrieval
· Cross-lingual retrieval
· Smoothing via Dirichlet priors and Markov chains
· Supports arbitrary document priors (e.g., Page Rank, URL depth)
What`s New in This Release: [ read full changelog ]
· 02) Click to expand/collapse Version: 4.12
· BUG# 3014524 -- Update google parser for query log toolbar server.
· BUG# 3014521 -- Query log toolbar server can now be run with an optional
· hostname parameter, which will be used instead of localhost if
· specified.
· BUG# 3013328 -- Fix crash on large queries in the CGI.
· BUG# 3013325 -- Fix CGI snippets.
· BUG# 3013315 -- Fix crash in CGI when fewer than 50 documents are
· returned.
· BUG# 3013313 -- Fix CGI to get document text when using multiple indri
· indexes.
· BUG# 3004284 -- Fix memory leaks in QueryEnvironment::expressionCount
· and QueryEnvironment::expressionList.
· BUG# 3000138 -- Fix snippet generation with queries that use the #max
· operator.
· BUG# 2989973 -- Prevent SIGPIPE being raised in IndriDaemon.
· BUG# 2985880 -- Prevent field restricted queries when using the non-LM
· baseline retrieval.
· BUG# 2982858 -- Modify the query parser to transform hyphenated terms
· into #1 expressions. This is closest to the result of splitting tokens
· on hyphens...