Table of Contents
- Original software
-
- CRFsuite: a fast implementation of Conditional Random Fields (CRFs)
- Classias: a collection of machine-learning algorithms for classification
- Marginal Containers Covering Relevant Items (MACCORI)
- libLBFGS
- SimString
- Static Double Array Trie (DASTrie)
- C++ implementation of Constant Database (CDB++)
- Constant Quark Database (CQDB)
- Polaris
- Redistribution
- Self-explainable programs
This section collects the software packages developed by me.
CRFSuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data. The first priority of this software is to train and use CRF models as fast as possible even at the expense of its memory space and code generality. CRFsuite runs 5.4 - 61.8 times faster than C++ implementations for training. CRFsuite supports parameter estimation with L1 regularization (Laplacian prior) using Orthant-Wise Limited-memory Quasi-Newton (OW-LQN) method and L2 regularization (Gaussian prior) using Limited-memory BFGS (L-BFGS) method.
CRFsuite: a fast implementation of Conditional Random Fields (CRFs) Web site
Classias is a collection of machine-learning algorithms for classification. Currently, this software supports the following formalizations: L1/L2-regularized logistic regression (aka. Maximum Entropy); L1/L2-regularized L1-loss linear-kernel Support Vector Machine (SVM); and Averaged perceptron. It implements several algorithms for training classifiers: Averaged perceptron, L-BFGS, OWL-QN, Pegasos, Truncated Gradient.
Marginal Containers Covering Relevant Items (MACCORI) is a program that finds a subset of given N containers that maximizes the utility (score) of the subset subject to a given constraint. Each container consists of a set of M items and their weights. A total score of a subset is defined to prefer relevant and non-redundant items, i.e., a resultant subset is expected to cover a wide-range of items with high weights. This software was developed for the task of sentence extraction for multi-document summarization.
Marginal Containers Covering Relevant Items (MACCORI) Web site
libLBFGS is a C port of the implementation of Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method written by Jorge Nocedal in FORTRAN. Unlike C codes generated automatically by f2c (Fortran 77 into C converter), this port includes changes based on my interpretations, improvements, optimizations, and clean-ups so that the ported code would be well-suited for a C code. It facilitates some optimizations and enhancements such as callback interface, thread safety, and SSE/SSE2 optimization.
SimString is an implementation of a simple and efficient algorithm for approximate string matching. Approximate string matching is the operation to retrieve strings in a string collection (database) whose similarity with a query string is no smaller than a threshold. Finding not only identical but also similar strings, SimString facilitates various applications including spelling correction, fuzzy search, approximate dictionary matching, duplicate record detection, database merging.
Trie is a data structure of ordered tree that implements an associative array. Looking up a record key (usually a string) is very efficient, which takes O(1) with respect to the number of stored records n. Trie is also known for efficient prefix matching, where the retrieved key strings are the prefixes of a given query string. Static Double Array Trie (DASTrie) is a C++ template library of static double-array trie. For the simplicity and efficiency, DASTrie focuses on building a static double array from a list of records sorted by dictionary order of keys. One can implement associative arrays (e.g., std::map) with arbitrary value types and/or sets (e.g., std::set), only by including a header file. DASTrie implements double arrays whose each element is 4 or 5 bytes long, whereas most implementations consume 8 bytes for an double-array element.
C++ implementation of Constant Database (CDB++) is a light-weight library for static hash database. By including a single header file (cdbpp.h), one can build compact hash database and search entries very quickly. However, CDB++ does not support dynamic update and deletion of elements from an existing database. CDB++ is suitable for implementing a database in which fast look-ups of keys and their values are essential while a database update rarely occurs.
Constant Quark Database (CQDB) is a database library specialized for serialization and retrieval of static associations between strings and integer identifiers. The database stores two-way associations between strings and integer identifiers; one can retrieve integer identifiers from strings and/or strings from integer identifiers. The performance of the database system is quite high: read/write speed is 10 to 100 times faster than existing database libraries; and the database size is about half of the one generated by existing database libraries. However, CQDB does not support assigning a unique integer identifier for a given string, modify associations, nor check collisions in strings and identifiers. CQDB is suitable for implementing dictionaries in which fast look-ups of strings and identifiers are essential while a dictionary update rarely occurs.
Polaris is an implementation of data-mining framework based on KeyGraph and the process of Chance Discovery proposed by Prof. Yukio Ohsawa. Polaris is a data-mining framework designed for the Double-helical Process of Chance Discovery and KeyGraph proposed by Prof. Yukio Ohsawa. It visulizes an input data as a graph by analyzing coccurrence relations in the data and extracting items that connects two or more frequent clusters (items).
This section redistributes some useful software.
eblook is a CUI frontend for computerized dictionaries developed by Keisuke Nishida, Kazuhiko, Satomi, and Takashi Nemoto. The routine of dictionary access is based on EB library. This software is useful to access some language resources such as Nihongo GoiTaikei (Japanese Lexicon CD-ROM). eblook was released under GNU General Public License (GPL).
This section collects small programs that are simple enough to omit documentations. The source codes shall tell everything about their usage. All programs in this section are released under zlib license.
Class optparse implements a parser for GNU-style command-line arguments. Inherit this class to define your own option variables and to implement an option handler with macros, BEGIN_OPTION_MAP, ON_OPTION(_WITH_ARG), and END_OPTION_MAP. Refer to the sample program attached at the bottom of the source code.
Template function strsplit implements a splitter for string values deliminated by a input_separator that can be escaped by a quotation character. This function can extract comma-separated values (CSV) or tab-separated values (TSV) from a line represented by a C++ std::string class. The splitted values will be stored into a object that is compatible to container adaptors such as std::vector and std::list. Refer to the sample program attached at the bottom of the source code.
Quark allocates an unique ID for each string and holds associations so that we can obtain the string value associated with an ID and/or the ID value associated with a string value. Since string matching is slower than integer comparison, it's a common technique for speed/memory optimization that an application converts all string values into integer identifiers, does some process with the integer values, and then restores the original string values.