CRFSuite: A fast implementation of Conditional Random Fields (CRFs)

Introduction

The CRFSuite C++/SWIG API provides a high-level and easy-to-use library module for a number of programming languages. The C++/SWIG API is a wrapper for the CRFSuite C API.

API documentation

C++ API

The C++ library is implemented in two header files, crfsuite_api.hpp and crfsuite.hpp. One can use the C++ API only by including crfsuite.hpp. The C++ library has a dependency to the CRFSuite C library, which means that the C header file (crfsuite.h) and libcrfsuite library are necessary.

SWIG API

The SWIG API is identical to the C++ API. Currently, the CRFsuite distribution includes a Python module for CRFsuite. Please read README under swig/python directory for the information to build the Python module.

Sample code

This code demonstrates how to use the crfsuite.Trainer object. The script reads a training data from STDIN, trains a model using 'l2sgd' algorithm, and stores the model to a file (the first argument of the commend line).

#!/usr/bin/env python

import crfsuite
import sys

# Inherit crfsuite.Trainer to implement message() function, which receives
# progress messages from a training process.
class Trainer(crfsuite.Trainer):
    def message(self, s):
        # Simply output the progress messages to STDOUT.
        sys.stdout.write(s)

def instances(fi):
    xseq = crfsuite.ItemSequence()
    yseq = crfsuite.StringList()
    
    for line in fi:
        line = line.strip('\n')
        if not line:
            # An empty line presents an end of a sequence.
            yield xseq, tuple(yseq)
            xseq = crfsuite.ItemSequence()
            yseq = crfsuite.StringList()
            continue

        # Split the line with TAB characters.
        fields = line.split('\t')
        
        # Append attributes to the item.
        item = crfsuite.Item()
        for field in fields[1:]:
            p = field.rfind(':')
            if p == -1:
                # Unweighted (weight=1) attribute.
                item.append(crfsuite.Attribute(field))
            else:
                # Weighted attribute
                item.append(crfsuite.Attribute(field[:p], float(field[p+1:])))
        
        # Append the item to the item sequence.
        xseq.append(item)
        # Append the label to the label sequence.
        yseq.append(fields[0])

if __name__ == '__main__':
    # This demonstrates how to obtain the version string of CRFsuite.
    print crfsuite.version()

    # Create a Trainer object.
    trainer = Trainer()
    
    # Read training instances from STDIN, and set them to trainer.
    for xseq, yseq in instances(sys.stdin):
        trainer.append(xseq, yseq, 0)

    # Use L2-regularized SGD and 1st-order dyad features.
    trainer.select('l2sgd', 'crf1d')
    
    # This demonstrates how to list parameters and obtain their values.
    for name in trainer.params():
        print name, trainer.get(name), trainer.help(name)
    
    # Set the coefficient for L2 regularization to 0.1
    trainer.set('c2', '0.1')
    
    # Start training; the training process will invoke trainer.message()
    # to report the progress.
    trainer.train(sys.argv[1], -1)

This code demonstrates how to use the crfsuite.Tagger object. The script loads a model from a file (the first argument of the commend line), reads a data from STDIN, predicts label sequences.

#!/usr/bin/env python

import crfsuite
import sys

def instances(fi):
    xseq = crfsuite.ItemSequence()
    
    for line in fi:
        line = line.strip('\n')
        if not line:
            # An empty line presents an end of a sequence.
            yield xseq
            xseq = crfsuite.ItemSequence()
            continue

        # Split the line with TAB characters.
        fields = line.split('\t')
        item = crfsuite.Item()
        for field in fields[1:]:
            p = field.rfind(':')
            if p == -1:
                # Unweighted (weight=1) attribute.
                item.append(crfsuite.Attribute(field))
            else:
                # Weighted attribute
                item.append(crfsuite.Attribute(field[:p], float(field[p+1:])))

        # Append the item to the item sequence.
        xseq.append(item)

if __name__ == '__main__':
    fi = sys.stdin
    fo = sys.stdout

    # Create a tagger object.
    tagger = crfsuite.Tagger()
    
    # Load the model to the tagger.
    tagger.open(sys.argv[1])

    for xseq in instances(fi):
        # Tag the sequence.
        tagger.set(xseq)
        # Obtain the label sequence predicted by the tagger.
        yseq = tagger.viterbi()
        # Output the probability of the predicted label sequence.
        print tagger.probability(yseq)
        for t, y in enumerate(yseq):
            # Output the predicted labels with their marginal probabilities.
            print '%s:%f' % (y, tagger.marginal(y, t))
        print