MG4J (Managing Gigabytes for Java) is a collaborative effort aimed at providing a free Java implementation of inverted-index compression techniques; as a by-product, it offers several general-purpose optimised classes, including fast & compact mutable strings, bit-level I/O, fast unsychronised buffered streams and (possibly signed) minimal perfect hashing.

Generating full-text inverted indices for very large sets of documents (say, hundreds of millions) and accessing them efficiently is a nontrivial task. MG4J tries to make the techniques described in the book Managing Gigabytes, by Ian Witten, Alistair Moffat and Timothy Bell, accessible without having to deal with bit-level operations in a clean, object-oriented environment.

MG4J provides a layered access to index construction and acccess. At the highest level, you can {@linkplain it.unimi.dsi.mg4j.tool build an index using the command-line tools}, open it using an {@link it.unimi.dsi.mg4j.index.Index}, and then interrogate it using our {@link it.unimi.dsi.mg4j.query.parser.QueryParser}, which will turn a query into a {@link it.unimi.dsi.mg4j.search.DocumentIterator}. Or you can {@linkplain it.unimi.dsi.mg4j.index.Index#getReader() get} from an {@link it.unimi.dsi.mg4j.index.Index} an {@link it.unimi.dsi.mg4j.index.IndexReader}, from which, given a term, you can obtain a {@link it.unimi.dsi.mg4j.index.IndexIterator} returning all documents containing the term (and the positions of the term in the document, if the index is full text).

MG4J is distributed under the GNU Lesser General Public License.

History and Motivation

MG4J is a spin-off of the Ubi Project: after the development of a distributed, fault-tolerant crawler a set of tools to index the results of a crawl was clearly a necessity. Since all techniques implemented are standard, distributing the resulting software seemed a good idea.

Writing in Java code that (essentially) has to roll bits over and over may seem a Bad Thing™. However, one should take into consideration the following points:

Conventions

All classes are not synchronised. If multiple threads access one of these classes concurrently, and at least one of the threads modifies it, it must be synchronised externally. Iterators will behave unpredictably in the presence of concurrent modifications.

Package Dependencies

MG4J uses three packages providing high-performance containers and algorithms, that is, the COLT distribution, Jal and fastutil. Moreover, all tools require the Java port of GNU getopt, and compiling MG4J requires javacc. Most utility and I/O classes, however, are completely self-contained.