SimString module provides two simple classes writer and reader. In the writer class, one can create a SimString database using the constructor writer::writer, and call the member function writer::insert for inserting a string into the database. In the reader class, one can open an existing SimString database with the constructor reader::reader, specify a similarity measure and threshold with two attributes reader::measure and reader::threshold, and call the member function reader::retrieve for performing approximate string matching.
SimString module always uses 8-bit null-terminated byte streams in writer::insert and reader::retrieve functions. The encoding of byte streams can be arbitrary, but must be UTF-8 for a database in Unicode mode.
#!/usr/bin/env python import simstring # Create a SimString database with two person names. db = simstring.writer('sample.db') db.insert('Barack Hussein Obama II') db.insert('James Gordon Brown') db.close() # Open the database for reading. db = simstring.reader('sample.db') # Use cosine similarity and threshold 0.6. db.measure = simstring.cosine db.threshold = 0.6 print(db.retrieve('Barack Obama')) # OK. print(db.retrieve('Gordon Brown')) # OK. print(db.retrieve('Obama')) # Too dissimilar! # Use overlap coefficient and threshold 1.0. db.measure = simstring.overlap db.threshold = 1. print(db.retrieve('Obama')) # OK.
A Unicode sample.
#!/usr/bin/env python # -*- coding:utf-8 -*- """ A Unicode sample. We assume that the source code is written in UTF-8 encoding (see the encoding declaration in line 2). We can use 8-bit strings as they are with SimString. """ import simstring # Open a SimString database for writing with Unicode mode. db = simstring.writer('sample_unicode.db', 3, False, True) # Write a string, and close the database. db.insert('スパゲティ') db.close() # Open the SimString database for reading. db = simstring.reader('sample_unicode.db') # Set a similarity measure and threshold. db.measure = simstring.cosine db.threshold = 0.6 # Use an 8-bit string encoded in UTF-8. print ' '.join(db.retrieve('スパゲティー')) # Convert a Unicode object into an UTF-8 query string. print ' '.join(db.retrieve(u'スパゲティー'.encode('utf-8')))
#!/usr/bin/env ruby require 'simstring' # Create a SimString database with two person names. db = Simstring::Writer.new('sample.db') db.insert('Barack Hussein Obama II') db.insert('James Gordon Brown') db.close() # Open the database for reading. db = Simstring::Reader.new('sample.db') # Use cosine similarity and threshold 0.6. db.measure = Simstring::Cosine db.threshold = 0.6 p(db.retrieve('Barack Obama')) # OK. p(db.retrieve('Gordon Brown')) # OK. p(db.retrieve('Obama')) # Too dissimilar! # Use overlap coefficient and threshold 1.0. db.measure = Simstring::Overlap db.threshold = 1 p(db.retrieve('Obama')) # OK.
A Unicode sample.
#!/usr/bin/env ruby -Ku require 'simstring' # Open a SimString database for writing with Unicode mode. db = Simstring::Writer.new('sample_unicode.db', 3, false, true) # Write a string, and close the database. db.insert('スパゲティ') db.close() # Open the database for reading. db = Simstring::Reader.new('sample_unicode.db') # Set a similarity measure and threshold. db.measure = Simstring::Cosine db.threshold = 0.6 # Use an 8-bit string in UTF-8 encoding. p(db.retrieve('スパゲティー'))