SimString is a simple library for fast approximate string retrieval. Approximate string retrieval finds strings in a database whose similarity with a query string is no smaller than a threshold. Finding not only identical but similar strings, approximate string retrieval has various applications including spelling correction, flexible dictionary matching, duplicate detection, and record linkage.
SimString supports cosine, Jaccard, dice, and overlap coefficients as similarity measures. SimString uses letter n-grams as features for computing string similarity.
SimString has the following features:
- Fast algorithm for approximate string retrieval. For example, SimString can find strings in Google Web1T unigrams (13,588,391 strings) that have cosine similarity ≧0.7 in 1.10 [ms] per query (on Intel Xeon 5140 2.33 GHz CPU).
- 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
- Unicode (wchar_t) support. For languages using multi-byte characters, developers can use Unicode characters (wchar_t) instead of single-byte characters (char) as a character representation.
- Implementation in C++ header files. Developers can add the funtionality of approximate string retrieval into C++ programs just by including a header file.
- Python and Ruby bindings via SWIG. Developers can easily perform approximate string retrieval in scripting languages.
The current release is SimString version 1.0.
SimString is distributed under the modified BSD license.
Please use the following BibTex entry when you cite SimString in your papers.
@InProceedings{Okazaki:Coling2010, author = {Okazaki, Naoaki and Tsujii, Jun'ichi}, title = {Simple and Efficient Algorithm for Approximate Dictionary Matching}, booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)}, month = {August}, year = {2010}, address = {Beijing, China}, pages = {851--859}, url = {http://www.aclweb.org/anthology/C10-1096} }
To build the simstring utility, please follow the general procedure, configure & make.
The source code of the simstring utility is located at frontend/main.cpp
.
Running "make" builds a binary frontend/simstring
.
$ ./configure $ make # To install the SimString header files $ make install
Add the include
directory in the distribution into the INCLUDE path when compiling your C++ program.
Please refer to the sample program in C++, the source code of the simstring utility (frontend/main.cpp
), and SimString C++ API Documentation.
The API specification is so simple that a developer can use it just by looking at the sample program.
Please refer to the sample program of SimString module and SimString SWIG module documentation. The API is so simple that a developer can use it just by looking at the sample program. Currently, build instructions for Python and Ruby modules are available, but it should be easy to build modules for other languages via SWIG. (It would be very helpful if you could submit a sample program in other scripting languages; I'm not so familar with scripting languages other than Python.)
Build simstring.py
and _simstring.so
, and install them.
$ ./configure $ cd swig/python $ ./prepare.sh $ python setup.py build_ext $ python setup.py install
Adding "--inplace" option to the command-line argument for build_ext builds simstring.py
and _simstring.so
in the current directory.
If these files are placed on the directory included in the module path of Python (e.g., the current directory where a Python process is created), one can try the SimString module without installing it.
Build and install simstring.so
.
$ ./configure $ cd swig/ruby $ ./prepare.sh $ ruby extconf.rb $ make $ make install
Running "make" builds simstring.so
in the current directory.
If the file is placed in the directory in the module path of Ruby (e.g., the current directory where a Ruby process is created), one can try the SimString module without installing it.
Building a SimString database (web1tuni/web1tuni.db
) from the Google Web 1T corpus (web1tuni.txt
).
The simstring utility reads strings from STDIN (one string per line).
Type "simstring -h" for the command-line usage.
$ simstring -b -d web1tuni/web1tuni.db < web1tuni.txt SimString 1.0 Copyright (c) 2009,2010 Naoaki Okazaki Constructing the database Database name: web1tuni/web1tuni.db N-gram length: 3 Begin/end marks: false Char type: c (1) Number of strings: 10000 Number of strings: 20000 Number of strings: 30000 ... Number of strings: 13570000 Number of strings: 13580000 Number of strings: 13588391 Flushing the database Total number of strings: 13588391 Seconds required: 221.76
Finding strings in the database (web1tuni/web1tuni.db
) whose cosine similarity is no smaller than 0.9 with a query.
The utility reads query strings from STDIN, and prints retrieved strings to STDOUT.
In this example, 12 strings, 1 string, and 5 strings are retrieved instantly for queries, "approximate", "string", and "retrieval", respectively.
$ simstring -d web1tuni/web1tuni.db -t 0.9 -s cosine approximate approximat pproximate approximate approximated approximatel approximates approximatey anapproximate approximatelt approximately reapproximate toapproximate 12 strings retrieved (0 sec) string string 1 strings retrieved (0 sec) retrieval etrieval retrieva retrieval retrieval. retrievals 5 strings retrieved (0 sec)
Building a SimString database using Unicode characters.
Add "-u" option to the simstring utility.
In this example, a SimString database web1tja/unigrams.db
is built from the Japanese N-gram data (web1tja/strings.txt
).
$ simstring -bu -d web1tja/unigrams.db < web1tja/strings.txt SimString 1.0 Copyright (c) 2009,2010 Naoaki Okazaki Constructing the database Database name: web1tja/unigrams.db N-gram length: 3 Begin/end marks: false Char type: w (4) Number of strings: 10000 Number of strings: 20000 Number of strings: 30000 ... Number of strings: 2560000 Number of strings: 2565424 Flushing the database Total number of strings: 2565424 Seconds required: 32.45
An example for retrieving strings in Unicode.
$ simstring -u -d web1tja/unigrams.db -t 0.7 -s cosine スパゲッティー スパゲッティ スパゲッテー スパゲティー スパッティー スパゲッティー スパゲッティーニ スパゲッティー・ スパッゲッティー スパゲッティーニ・ ・・スパゲッティー スープスパゲッティー スパゲッティーモンスター 12 strings retrieved (0 sec) フェットチーネ フェットチー フェトチーネ フットチーネ フェットチーネ フェットゥチーネ フェットチーネ・ フェット・チーネ フェットゥッチーネ 8 strings retrieved (0 sec)
An example using SimString in the interactive mode of Python.
$ python Python 2.5.1 (r251:54863, Oct 15 2007, 23:38:19) [GCC 3.4.6 20060404 (Red Hat 3.4.6-8)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import simstring >>> db = simstring.reader('web1tuni/web1tuni.db') >>> db.measure = simstring.cosine >>> db.threshold = 0.9 >>> db.retrieve('approximate') ('approximat', 'pproximate', 'approximate', 'approximated', 'approximatel', 'appr oximates', 'approximatey', 'anapproximate', 'approximatelt', 'approximately', 're approximate', 'toapproximate') >>> db = simstring.reader('web1tja/unigrams.db') >>> db.measure = simstring.cosine >>> db.threshold = 0.7 >>> print(' '.join(db.retrieve('スパゲッティー'))) スパゲッティ スパゲッテー スパゲティー スパッティー スパゲッティー スパゲッティー ニ スパゲッティー・ スパッゲッティー スパゲッティーニ・ ・・スパゲッティー スープ スパゲッティー スパゲッティーモンスター >>>