SimString SWIG interface

Introduction

This document describes a SWIG interface that bridges SimString with various programing languages including Python and Ruby. Although SimString currently distribution provides SWIG wrappers for Python and Ruby, it may be easy to build libraries for other languages.

SimString module provides two simple classes writer and reader. In the writer class, one can create a SimString database using the constructor writer::writer, and call the member function writer::insert for inserting a string into the database. In the reader class, one can open an existing SimString database with the constructor reader::reader, specify a similarity measure and threshold with two attributes reader::measure and reader::threshold, and call the member function reader::retrieve for performing approximate string matching.

SimString module always uses 8-bit null-terminated byte streams in writer::insert and reader::retrieve functions. The encoding of byte streams can be arbitrary, but must be UTF-8 for a database in Unicode mode.

Documentation

Language-specific Notes

Ruby

Sample Programs

Python

A basic sample.

#!/usr/bin/env python

import simstring

# Create a SimString database with two person names.
db = simstring.writer('sample.db')
db.insert('Barack Hussein Obama II')
db.insert('James Gordon Brown')
db.close()


# Open the database for reading.
db = simstring.reader('sample.db')

# Use cosine similarity and threshold 0.6.
db.measure = simstring.cosine
db.threshold = 0.6
print(db.retrieve('Barack Obama'))      # OK.
print(db.retrieve('Gordon Brown'))      # OK.
print(db.retrieve('Obama'))             # Too dissimilar!

# Use overlap coefficient and threshold 1.0.
db.measure = simstring.overlap
db.threshold = 1.
print(db.retrieve('Obama'))             # OK.

A Unicode sample.

#!/usr/bin/env python
# -*- coding:utf-8 -*-

"""
A Unicode sample.

We assume that the source code is written in UTF-8 encoding (see the
encoding declaration in line 2). We can use 8-bit strings as they are
with SimString.
"""

import simstring

# Open a SimString database for writing with Unicode mode.
db = simstring.writer('sample_unicode.db', 3, False, True)

# Write a string, and close the database.
db.insert('スパゲティ')
db.close()


# Open the SimString database for reading.
db = simstring.reader('sample_unicode.db')

# Set a similarity measure and threshold.
db.measure = simstring.cosine
db.threshold = 0.6

# Use an 8-bit string encoded in UTF-8.
print ' '.join(db.retrieve('スパゲティー'))

# Convert a Unicode object into an UTF-8 query string.
print ' '.join(db.retrieve(u'スパゲティー'.encode('utf-8')))

Ruby

A basic sample.

#!/usr/bin/env ruby

require 'simstring'

# Create a SimString database with two person names.
db = Simstring::Writer.new('sample.db')
db.insert('Barack Hussein Obama II')
db.insert('James Gordon Brown')
db.close()


# Open the database for reading.
db = Simstring::Reader.new('sample.db')

# Use cosine similarity and threshold 0.6.
db.measure = Simstring::Cosine
db.threshold = 0.6
p(db.retrieve('Barack Obama'))      # OK.
p(db.retrieve('Gordon Brown'))      # OK.
p(db.retrieve('Obama'))             # Too dissimilar!

# Use overlap coefficient and threshold 1.0.
db.measure = Simstring::Overlap
db.threshold = 1
p(db.retrieve('Obama'))             # OK.

A Unicode sample.

#!/usr/bin/env ruby -Ku

require 'simstring'

# Open a SimString database for writing with Unicode mode.
db = Simstring::Writer.new('sample_unicode.db', 3, false, true)

# Write a string, and close the database.
db.insert('スパゲティ')
db.close()


# Open the database for reading.
db = Simstring::Reader.new('sample_unicode.db')

# Set a similarity measure and threshold.
db.measure = Simstring::Cosine
db.threshold = 0.6

# Use an 8-bit string in UTF-8 encoding.
p(db.retrieve('スパゲティー'))

Copyright (c) 2002-2010 by Naoaki Okazaki
Sun Mar 7 18:18:45 2010