SimString - A fast and simple algorithm for approximate string matching/retrieval

Introduction

SimString is a simple library for fast approximate string retrieval. Approximate string retrieval finds strings in a database whose similarity with a query string is no smaller than a threshold. Finding not only identical but similar strings, approximate string retrieval has various applications including spelling correction, flexible dictionary matching, duplicate detection, and record linkage.

SimString supports cosine, Jaccard, dice, and overlap coefficients as similarity measures. SimString uses letter n-grams as features for computing string similarity.

SimString has the following features:

Fast algorithm for approximate string retrieval. For example, SimString can find strings in Google Web1T unigrams (13,588,391 strings) that have cosine similarity ≧0.7 in 1.10 [ms] per query (on Intel Xeon 5140 2.33 GHz CPU).
100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
Unicode (wchar_t) support. For languages using multi-byte characters, developers can use Unicode characters (wchar_t) instead of single-byte characters (char) as a character representation.
Implementation in C++ header files. Developers can add the funtionality of approximate string retrieval into C++ programs just by including a header file.
Python and Ruby bindings via SWIG. Developers can easily perform approximate string retrieval in scripting languages.

Download

The current release is SimString version 1.0.

Source code

SimString is distributed under the modified BSD license.

Please use the following BibTex entry when you cite SimString in your papers.

@InProceedings{Okazaki:Coling2010,
  author    = {Okazaki, Naoaki  and  Tsujii, Jun'ichi},
  title     = {Simple and Efficient Algorithm for Approximate Dictionary Matching},
  booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)},
  month     = {August},
  year      = {2010},
  address   = {Beijing, China},
  pages     = {851--859},
  url       = {http://www.aclweb.org/anthology/C10-1096}
}

Change log

SimString 1.0 (2010-03-07)

Initial release.

How to build

Building simstring utility

To build the simstring utility, please follow the general procedure, configure & make. The source code of the simstring utility is located at frontend/main.cpp. Running "make" builds a binary frontend/simstring.

$ ./configure
$ make
# To install the SimString header files
$ make install

Using the SimString library from C++ programs

Add the include directory in the distribution into the INCLUDE path when compiling your C++ program. Please refer to the sample program in C++, the source code of the simstring utility (frontend/main.cpp), and SimString C++ API Documentation. The API specification is so simple that a developer can use it just by looking at the sample program.

Building a Python/Ruby binding of SimString

Please refer to the sample program of SimString module and SimString SWIG module documentation. The API is so simple that a developer can use it just by looking at the sample program. Currently, build instructions for Python and Ruby modules are available, but it should be easy to build modules for other languages via SWIG. (It would be very helpful if you could submit a sample program in other scripting languages; I'm not so familar with scripting languages other than Python.)

Building a Python module

Build simstring.py and _simstring.so, and install them.

$ ./configure
$ cd swig/python
$ ./prepare.sh
$ python setup.py build_ext
$ python setup.py install

Adding "--inplace" option to the command-line argument for build_ext builds simstring.py and _simstring.so in the current directory. If these files are placed on the directory included in the module path of Python (e.g., the current directory where a Python process is created), one can try the SimString module without installing it.

Building a Ruby module

Build and install simstring.so.

$ ./configure
$ cd swig/ruby
$ ./prepare.sh
$ ruby extconf.rb
$ make
$ make install

Running "make" builds simstring.so in the current directory. If the file is placed in the directory in the module path of Ruby (e.g., the current directory where a Ruby process is created), one can try the SimString module without installing it.

Usage examples

Building a SimString database (web1tuni/web1tuni.db) from the Google Web 1T corpus (web1tuni.txt). The simstring utility reads strings from STDIN (one string per line). Type "simstring -h" for the command-line usage.

$ simstring -b -d web1tuni/web1tuni.db < web1tuni.txt
SimString 1.0 Copyright (c) 2009,2010 Naoaki Okazaki

Constructing the database
Database name: web1tuni/web1tuni.db
N-gram length: 3
Begin/end marks: false
Char type: c (1)
Number of strings: 10000
Number of strings: 20000
Number of strings: 30000
...
Number of strings: 13570000
Number of strings: 13580000
Number of strings: 13588391

Flushing the database

Total number of strings: 13588391
Seconds required: 221.76

Finding strings in the database (web1tuni/web1tuni.db) whose cosine similarity is no smaller than 0.9 with a query. The utility reads query strings from STDIN, and prints retrieved strings to STDOUT. In this example, 12 strings, 1 string, and 5 strings are retrieved instantly for queries, "approximate", "string", and "retrieval", respectively.

$ simstring -d web1tuni/web1tuni.db -t 0.9 -s cosine
approximate
        approximat
        pproximate
        approximate
        approximated
        approximatel
        approximates
        approximatey
        anapproximate
        approximatelt
        approximately
        reapproximate
        toapproximate
12 strings retrieved (0 sec)
string
        string
1 strings retrieved (0 sec)
retrieval
        etrieval
        retrieva
        retrieval
        retrieval.
        retrievals
5 strings retrieved (0 sec)

Building a SimString database using Unicode characters. Add "-u" option to the simstring utility. In this example, a SimString database web1tja/unigrams.db is built from the Japanese N-gram data (web1tja/strings.txt).

$ simstring -bu -d web1tja/unigrams.db < web1tja/strings.txt
SimString 1.0 Copyright (c) 2009,2010 Naoaki Okazaki

Constructing the database
Database name: web1tja/unigrams.db
N-gram length: 3
Begin/end marks: false
Char type: w (4)
Number of strings: 10000
Number of strings: 20000
Number of strings: 30000
...
Number of strings: 2560000
Number of strings: 2565424

Flushing the database

Total number of strings: 2565424
Seconds required: 32.45

An example for retrieving strings in Unicode.

$ simstring -u -d web1tja/unigrams.db -t 0.7 -s cosine
スパゲッティー
        スパゲッティ
        スパゲッテー
        スパゲティー
        スパッティー
        スパゲッティー
        スパゲッティーニ
        スパゲッティー・
        スパッゲッティー
        スパゲッティーニ・
        ・・スパゲッティー
        スープスパゲッティー
        スパゲッティーモンスター
12 strings retrieved (0 sec)
フェットチーネ
        フェットチー
        フェトチーネ
        フットチーネ
        フェットチーネ
        フェットゥチーネ
        フェットチーネ・
        フェット・チーネ
        フェットゥッチーネ
8 strings retrieved (0 sec)

An example using SimString in the interactive mode of Python.

$ python
Python 2.5.1 (r251:54863, Oct 15 2007, 23:38:19)
[GCC 3.4.6 20060404 (Red Hat 3.4.6-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import simstring
>>> db = simstring.reader('web1tuni/web1tuni.db')
>>> db.measure = simstring.cosine
>>> db.threshold = 0.9
>>> db.retrieve('approximate')
('approximat', 'pproximate', 'approximate', 'approximated', 'approximatel', 'appr
oximates', 'approximatey', 'anapproximate', 'approximatelt', 'approximately', 're
approximate', 'toapproximate')
>>> db = simstring.reader('web1tja/unigrams.db')
>>> db.measure = simstring.cosine
>>> db.threshold = 0.7
>>> print(' '.join(db.retrieve('スパゲッティー')))
スパゲッティ スパゲッテー スパゲティー スパッティー スパゲッティー スパゲッティー
ニ スパゲッティー・ スパッゲッティー スパゲッティーニ・ ・・スパゲッティー スープ
スパゲッティー スパゲッティーモンスター
>>>