Table of Contents
Text chunking divides a text into syntactically correlated parts of words. For example, the sentence “He reckons the current account deficit will narrow to only # 1.8 billion in September.” can be divided as follows:
[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] .
In this example, NP stands for a noun phrase, VP for a verb phrase, and PP for a prepositional phrase. This task is formalized as a sequential labeling task in which a sequence of tokens in a text is assigned with a sequence of labels. In order to represent a chunk (a span of tokens) with labels, we often use the IOB2 notation. Using the IOB2 notation, a chunk NP is represented by a begin of a chunk (B-NP) and an inside of a chunk (I-NP). Tokens that do not belong to a chunk are represented by O labels.
B-NP He B-VP reckons B-NP the I-NP current I-NP account I-NP deficit B-VP will I-VP narrow B-PP to B-NP only I-NP # I-NP 1.8 I-NP billion B-PP in B-NP September O .
The goal of this tutorial is to build a model that predicts chunk labels for a given sentence (sequence of tokens) by using CRFsuite.
This tutorial uses the training and testing data distributed by the CoNLL 2000 shared task.
Necessary scripts for this tutorial are included under example directory in the CRFsuite distribution.
Firstly, move the current directory to the example directory and download the training and testing data from their website:
$ cd example $ wget http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz $ wget http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz $ less train.txt.gz ... (snip) ... London JJ B-NP shares NNS I-NP closed VBD B-VP moderately RB B-ADVP lower JJR I-ADVP in IN B-PP thin JJ B-NP trading NN I-NP . . O At IN B-PP Tokyo NNP B-NP , , O the DT B-NP Nikkei NNP I-NP index NN I-NP of IN B-PP 225 CD B-NP selected VBN I-NP issues NNS I-NP was VBD B-VP up IN B-ADVP 112.16 CD B-NP points NNS I-NP to TO B-PP 35486.38 CD B-NP . . O ... (snip) ...
The data consists of a set of sentences (sequences) each of which contains a series of words (e.g., 'London', 'shares'), part-of-speech tags (e.g., 'JJ', 'NNS'), and chunk labels (e.g., 'B-NP', 'I-NP') separated by space characters. In this tutorial, we would like to construct a CRF model that assigns a sequence of chunk labels, given a sequence of words and part-of-speech codes. Please refer to CoNLL 2000 shared task website for more information about the data set.
The next step is to preprocess the training and testing data to extract attributes that express the characteristics of words (items) in the data. CRFsuite internally generates features from attributes in a data set. In general, this is the most important process for machine-learning approaches because a feature design greatly affects the labeling accuracy. In this tutorial, we extract 19 kinds of attributes from a word at position t (in offsets from the begining of a sequence):
- w[t-2],- w[t-1],- w[t],- w[t+1],- w[t+2],
- w[t-1]|w[t],- w[t]|w[t+1],
- pos[t-2],- pos[t-1],- pos[t],- pos[t+1],- pos[t+2],
- pos[t-2]|pos[t-1],- pos[t-1]|pos[t],- pos[t]|pos[t+1],- pos[t+1]|pos[t+2],
- pos[t-2]|pos[t-1]|pos[t],- pos[t-1]|pos[t]|pos[t+1],- pos[t]|pos[t+1]|pos[t+2]
In this list, w[t] and pos[t] present the word and part-of-speech respectively at position t in a sequence.
These features express the characteristic of the word at position t by using information from surrounding words, e.g., w[t-1] and pos[t+1].
For example, the token 'the' in the following example,
        He PRP B-NP
        reckons VBZ B-VP
  t --> the DT B-NP
        current JJ I-NP
        account NN I-NP
            obtains these attributes (position t is omitted for simplicity),
- w[-2]=He, w[-1]=reckons, w[0]=the, w[1]=current, w[2]=account
- w[-1]|w[0]=reckons|the, w[0]|w[1]=the|current
- pos[-2]=PRP, pos[-1]=VBZ, pos[0]=DT, pos[1]=JJ, pos[2]=NN
- pos[-2]|pos[-1]=PRP|VBZ, pos[-1]|pos[0]=VBZ|DT, pos[0]|pos[1]=DT|JJ, pos[1]|pos[2]=JJ|NN
- pos[-2]|pos[-1]|pos[0]=PRP|VBZ|DT, pos[-1]|pos[0]|pos[1]=VBZ|DT|JJ, pos[0]|pos[1]|pos[2]=DT|JJ|NN
In this example, the attribute "w[0]=the" presents the event where the current token is "the", and the attribute "pos[0]|pos[1]|pos[2]=DT|JJ|NN" presents the event where the parts-of-speech at the current, next, and two words ahead are DT, JJ, NN, respectively. CRFsuite will learn associations between these attributes (e.g, "pos[0]|pos[1]|pos[2]=DT|JJ|NN") and labels (e.g., "B-NP") to predict a label sequence for a given text. Please note that an attribute need not follow the convention "name=value", e.g., "w[0]=the". CRFsuite accepts any string as an attribute name as long as the string does not contain a colon character (that is used to separate an attribute name and its weight). The convention "name=value" is merely for the convenience to interpret attribute names.
CRFsuite requires a data set in which an item line begins with its label, followed by its attributes separated by TAB ('\t') characters (see documentation for more information).
It is not difficult to implement the conversion from the training/testing data to CRFsuite data.
As an implementation of the conversion, the CRFsuite distribution includes a Python script chunking.py that generates attributes from the CoNLL 2000 data.
The procedure below converts train.txt.gz and test.txt.gz into train.crfsuite.txt and test.crfsuite.txt that are compatible with the CRFsuite data format.
$ zcat train.txt.gz | ./chunking.py > train.crfsuite.txt
$ zcat test.txt.gz | ./chunking.py > test.crfsuite.txt
$ less train.crfsuite.txt
... (snip) ...
B-NP    w[0]=He w[1]=reckons    w[2]=the        w[0]|w[1]=He|reckons    pos[0]=P
RP      pos[1]=VBZ      pos[2]=DT       pos[0]|pos[1]=PRP|VBZ   pos[1]|pos[2]=VB
Z|DT    pos[0]|pos[1]|pos[2]=PRP|VBZ|DT __BOS__
B-VP    w[-1]=He        w[0]=reckons    w[1]=the        w[2]=current    w[-1]|w[
0]=He|reckons   w[0]|w[1]=reckons|the   pos[-1]=PRP     pos[0]=VBZ      pos[1]=D
T       pos[2]=JJ       pos[-1]|pos[0]=PRP|VBZ  pos[0]|pos[1]=VBZ|DT    pos[1]|p
os[2]=DT|JJ     pos[-1]|pos[0]|pos[1]=PRP|VBZ|DT        pos[0]|pos[1]|pos[2]=VBZ
|DT|JJ
B-NP    w[-2]=He        w[-1]=reckons   w[0]=the        w[1]=current    w[2]=acc
ount    w[-1]|w[0]=reckons|the  w[0]|w[1]=the|current   pos[-2]=PRP     pos[-1]=
VBZ     pos[0]=DT       pos[1]=JJ       pos[2]=NN       pos[-2]|pos[-1]=PRP|VBZ 
pos[-1]|pos[0]=VBZ|DT   pos[0]|pos[1]=DT|JJ     pos[1]|pos[2]=JJ|NN     pos[-2]|
pos[-1]|pos[0]=PRP|VBZ|DT       pos[-1]|pos[0]|pos[1]=VBZ|DT|JJ pos[0]|pos[1]|po
s[2]=DT|JJ|NN
I-NP    w[-2]=reckons   w[-1]=the       w[0]=current    w[1]=account    w[2]=def
icit    w[-1]|w[0]=the|current  w[0]|w[1]=current|account       pos[-2]=VBZ
     pos[-1]=DT      pos[0]=JJ       pos[1]=NN       pos[2]=NN       pos[-2]|pos
[-1]=VBZ|DT  pos[-1]|pos[0]=DT|JJ    pos[0]|pos[1]=JJ|NN     pos[1]|pos[2]=NN|NN
     pos[-2]|pos[-1]|pos[0]=VBZ|DT|JJ        pos[-1]|pos[0]|pos[1]=DT|JJ|NN  pos
[0]|pos[1]|pos[2]=JJ|NN|NN
I-NP    w[-2]=the       w[-1]=current   w[0]=account    w[1]=deficit    w[2]=wil
l       w[-1]|w[0]=current|account      w[0]|w[1]=account|deficit       pos[-2]=
DT      pos[-1]=JJ      pos[0]=NN       pos[1]=NN       pos[2]=MD       pos[-2]|
pos[-1]=DT|JJ   pos[-1]|pos[0]=JJ|NN    pos[0]|pos[1]=NN|NN     pos[1]|pos[2]=NN
|MD     pos[-2]|pos[-1]|pos[0]=DT|JJ|NN pos[-1]|pos[0]|pos[1]=JJ|NN|NN  pos[0]|p
os[1]|pos[2]=NN|NN|MD
... (snip) ...
          
Now we are ready to use CRFsuite for training.
Simply type the following command to train a CRF model from train.crfsuite.txt.
CRFsuite will read the training data, generate necessary state (attribute-label) and transition (label bigram) features based on the data, maximize the log-likelihood of the conditional probability distribution, and store the model into CoNLL2000.model.
$ crfsuite learn -m CoNLL2000.model train.crfsuite.txt CRFSuite 0.12 Copyright (c) 2007-2011 Naoaki Okazaki Start time of the training: 2011-06-25T14:52:13Z Reading the data set(s) [1] train.crfsuite.txt 0....1....2....3....4....5....6....7....8....9....10 Number of instances: 8937 Seconds required: 5.890 Statistics the data set(s) Number of data sets (groups): 1 Number of instances: 8936 Number of items: 211727 Number of attributes: 335674 Number of labels: 22 Feature generation type: CRF1d feature.minfreq: 0.000000 feature.possible_states: 0 feature.possible_transitions: 0 0....1....2....3....4....5....6....7....8....9....10 Number of features: 452755 Seconds required: 2.140 L-BFGS optimization c1: 0.000000 c2: 1.000000 num_memories: 6 max_iterations: 2147483647 epsilon: 0.000010 stop: 10 delta: 0.000010 linesearch: MoreThuente linesearch.max_iterations: 20 ***** Iteration #1 ***** Log-likelihood: -275528.648286 Feature norm: 5.000000 Error norm: 44363.015822 Active features: 452755 Line search trials: 2 Line search step: 0.000050 Seconds required for this iteration: 4.860 ***** Iteration #2 ***** Log-likelihood: -164450.778877 Feature norm: 9.067189 Error norm: 26619.939310 Active features: 452755 Line search trials: 1 Line search step: 1.000000 Seconds required for this iteration: 1.630 ... (snip) ... ***** Iteration #165 ***** Log-likelihood: -13139.375165 Feature norm: 81.074163 Error norm: 2.638386 Active features: 452755 Line search trials: 1 Line search step: 1.000000 Seconds required for this iteration: 1.660 L-BFGS terminated with the stopping criteria Total seconds required for training: 293.610 Storing the model Number of active features: 452755 (452755) Number of active attributes: 335674 (335674) Number of active labels: 22 (22) Writing labels Writing attributes Writing feature references for transitions Writing feature references for attributes Seconds required: 0.730 End time of the training: 2011-06-25T14:57:15Z
You can also train a CRF model, watching its performance (accuracy, precision, recall, f1 score) evaluated on the test data.
It must be exciting to see your model improved as the training process advances!
The following command-line performs a holdout evaluation on the data set #2 (test.crfsuite.txt) with the option (-e2).
$ crfsuite learn -e2 train.crfsuite.txt test.crfsuite.txt
CRFSuite 0.12  Copyright (c) 2007-2011 Naoaki Okazaki
Start time of the training: 2011-06-25T16:07:40Z
Reading the data set(s)
[1] train.crfsuite.txt
0....1....2....3....4....5....6....7....8....9....10
Number of instances: 8937
Seconds required: 5.870
[2] test.crfsuite.txt
0....1....2....3....4....5....6....7....8....9....10
Number of instances: 2013
Seconds required: 1.370
Statistics the data set(s)
Number of data sets (groups): 2
Number of instances: 10948
Number of items: 259104
Number of attributes: 387579
Number of labels: 23
Holdout group: 2
Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 452755
Seconds required: 2.150
L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20
***** Iteration #1 *****
Log-likelihood: -279935.059188
Feature norm: 5.000000
Error norm: 45136.783202
Active features: 452755
Line search trials: 2
Line search step: 0.000050
Seconds required for this iteration: 5.230
Performance by label (#match, #model, #ref) (precision, recall, F1):
    B-NP: (8312, 10265, 12422) (0.8097, 0.6691, 0.7328)
    B-PP: (3986, 6030, 4811) (0.6610, 0.8285, 0.7354)
    I-NP: (14116, 27744, 14376) (0.5088, 0.9819, 0.6703)
    B-VP: (0, 0, 4658) (0.0000, 0.0000, 0.0000)
    I-VP: (0, 0, 2646) (0.0000, 0.0000, 0.0000)
    B-SBAR: (0, 0, 535) (0.0000, 0.0000, 0.0000)
    O: (3298, 3338, 6180) (0.9880, 0.5337, 0.6930)
    B-ADJP: (0, 0, 438) (0.0000, 0.0000, 0.0000)
    B-ADVP: (0, 0, 866) (0.0000, 0.0000, 0.0000)
    I-ADVP: (0, 0, 89) (0.0000, 0.0000, 0.0000)
    I-ADJP: (0, 0, 167) (0.0000, 0.0000, 0.0000)
    I-SBAR: (0, 0, 4) (0.0000, 0.0000, 0.0000)
    I-PP: (0, 0, 48) (0.0000, 0.0000, 0.0000)
    B-PRT: (0, 0, 106) (0.0000, 0.0000, 0.0000)
    B-LST: (0, 0, 5) (0.0000, 0.0000, 0.0000)
    B-INTJ: (0, 0, 2) (0.0000, 0.0000, 0.0000)
    I-INTJ: (0, 0, 0) (******, ******, ******)
    B-CONJP: (0, 0, 9) (0.0000, 0.0000, 0.0000)
    I-CONJP: (0, 0, 13) (0.0000, 0.0000, 0.0000)
    I-PRT: (0, 0, 0) (******, ******, ******)
    B-UCP: (0, 0, 0) (******, ******, ******)
    I-UCP: (0, 0, 0) (******, ******, ******)
    I-LST: (0, 0, 2) (0.0000, 0.0000, 0.0000)
Macro-average precision, recall, F1: (0.129025, 0.131010, 0.123104)
Item accuracy: 29712 / 47377 (0.6271)
Instance accuracy: 23 / 2012 (0.0114)
... (snip) ...
***** Iteration #162 *****
Log-likelihood: -13143.933308
Feature norm: 81.103204
Error norm: 2.207139
Active features: 452755
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 1.980
Performance by label (#match, #model, #ref) (precision, recall, F1):
    B-NP: (12013, 12374, 12422) (0.9708, 0.9671, 0.9689)
    B-PP: (4713, 4878, 4811) (0.9662, 0.9796, 0.9729)
    I-NP: (13998, 14497, 14376) (0.9656, 0.9737, 0.9696)
    B-VP: (4470, 4668, 4658) (0.9576, 0.9596, 0.9586)
    I-VP: (2551, 2700, 2646) (0.9448, 0.9641, 0.9544)
    B-SBAR: (449, 499, 535) (0.8998, 0.8393, 0.8685)
    O: (5945, 6122, 6180) (0.9711, 0.9620, 0.9665)
    B-ADJP: (322, 403, 438) (0.7990, 0.7352, 0.7658)
    B-ADVP: (711, 836, 866) (0.8505, 0.8210, 0.8355)
    I-ADVP: (54, 82, 89) (0.6585, 0.6067, 0.6316)
    I-ADJP: (110, 137, 167) (0.8029, 0.6587, 0.7237)
    I-SBAR: (2, 15, 4) (0.1333, 0.5000, 0.2105)
    I-PP: (34, 42, 48) (0.8095, 0.7083, 0.7556)
    B-PRT: (80, 102, 106) (0.7843, 0.7547, 0.7692)
    B-LST: (0, 0, 5) (0.0000, 0.0000, 0.0000)
    B-INTJ: (1, 1, 2) (1.0000, 0.5000, 0.6667)
    I-INTJ: (0, 0, 0) (******, ******, ******)
    B-CONJP: (5, 8, 9) (0.6250, 0.5556, 0.5882)
    I-CONJP: (10, 13, 13) (0.7692, 0.7692, 0.7692)
    I-PRT: (0, 0, 0) (******, ******, ******)
    B-UCP: (0, 0, 0) (******, ******, ******)
    I-UCP: (0, 0, 0) (******, ******, ******)
    I-LST: (0, 0, 2) (0.0000, 0.0000, 0.0000)
Macro-average precision, recall, F1: (0.604705, 0.576296, 0.581536)
Item accuracy: 45468 / 47377 (0.9597)
Instance accuracy: 1176 / 2012 (0.5845)
L-BFGS terminated with the stopping criteria
Total seconds required for training: 339.800
End time of the training: 2011-06-25T16:13:29Z
            This log message reports that the CRF model obtained from the training data achieved 95.97% item accuracy.
You can apply the CRF model and tag chunk labels to the test data. Even though the test data distributed by the CoNLL 2000 shared task has chunk labels annotated (for evaluation purposes), CRFsuite ignores the existing labels and outputs label sequences (one label per line; delimitered by empty lines) predicted by the model.
$ cat test.crfsuite.txt
B-NP    w[0]=Rockwell   w[1]=International      w[2]=Corp.      w[0]|w[1]=Rockwe
ll|International        pos[0]=NNP      pos[1]=NNP      pos[2]=NNP      pos[0]|p
os[1]=NNP|NNP   pos[1]|pos[2]=NNP|NNP   pos[0]|pos[1]|pos[2]=NNP|NNP|NNP
        __BOS__
I-NP    w[-1]=Rockwell  w[0]=International      w[1]=Corp.      w[2]='s w[-1]|w[
0]=Rockwell|International       w[0]|w[1]=International|Corp.   pos[-1]=NNP
     pos[0]=NNP      pos[1]=NNP      pos[2]=POS      pos[-1]|pos[0]=NNP|NNP  pos
[0]|pos[1]=NNP|NNP   pos[1]|pos[2]=NNP|POS   pos[-1]|pos[0]|pos[1]=NNP|NNP|NNP  
     pos[0]|pos[1]|pos[2]=NNP|NNP|POS
... (snip) ...
$ crfsuite tag -m CoNLL2000.model test.crfsuite.txt
B-NP
I-NP
I-NP
B-NP
I-NP
I-NP
B-VP
B-NP
B-VP
B-NP
I-NP
I-NP
B-VP
B-NP
I-NP
B-PP
B-NP
I-NP
B-VP
I-VP
B-NP
I-NP
B-PP
B-NP
B-NP
I-NP
I-NP
O
... (snip) ...
            
CRFsuite can output both reference labels (in test.crfsuite.txt) and predicted labels separated by TAB characters.
In this example, a left label in each line presents the reference label written in the input data (test.crfsuite.txt), ,and a right label presents the predicted label.
This functionality may be useful for evaluating tagging results.
$ crfsuite tag -r -m CoNLL2000.model test.crfsuite.txt B-NP B-NP I-NP I-NP I-NP I-NP B-NP B-NP I-NP I-NP I-NP I-NP B-VP B-VP B-NP B-NP B-VP B-VP B-NP B-NP I-NP I-NP I-NP I-NP B-VP B-VP B-NP B-NP I-NP I-NP B-PP B-PP B-NP B-NP I-NP I-NP B-VP B-VP I-VP I-VP B-NP B-NP I-NP I-NP B-PP B-PP B-NP B-NP B-NP B-NP I-NP I-NP I-NP I-NP O O ... (snip) ...
CRFsuite can also evaluate the CRF model with labeled test data with "-qt" options.
$ crfsuite tag -qt -m CoNLL2000.model test.crfsuite.txt
Performance by label (#match, #model, #ref) (precision, recall, F1):
    B-NP: (12000, 12358, 12407) (0.9710, 0.9672, 0.9691)
    B-PP: (4707, 4872, 4805) (0.9661, 0.9796, 0.9728)
    I-NP: (13984, 14484, 14359) (0.9655, 0.9739, 0.9697)
    B-VP: (4466, 4662, 4653) (0.9580, 0.9598, 0.9589)
    I-VP: (2549, 2698, 2643) (0.9448, 0.9644, 0.9545)
    B-SBAR: (448, 498, 534) (0.8996, 0.8390, 0.8682)
    O: (5939, 6113, 6174) (0.9715, 0.9619, 0.9667)
    B-ADJP: (322, 403, 438) (0.7990, 0.7352, 0.7658)
    B-ADVP: (711, 835, 866) (0.8515, 0.8210, 0.8360)
    I-ADVP: (54, 82, 89) (0.6585, 0.6067, 0.6316)
    I-ADJP: (110, 137, 167) (0.8029, 0.6587, 0.7237)
    I-SBAR: (2, 15, 4) (0.1333, 0.5000, 0.2105)
    I-PP: (34, 42, 48) (0.8095, 0.7083, 0.7556)
    B-PRT: (80, 102, 106) (0.7843, 0.7547, 0.7692)
    B-LST: (0, 0, 4) (0.0000, 0.0000, 0.0000)
    B-INTJ: (1, 1, 2) (1.0000, 0.5000, 0.6667)
    I-INTJ: (0, 0, 0) (******, ******, ******)
    B-CONJP: (5, 7, 9) (0.7143, 0.5556, 0.6250)
    I-CONJP: (10, 12, 13) (0.8333, 0.7692, 0.8000)
    I-PRT: (0, 0, 0) (******, ******, ******)
    B-UCP: (0, 0, 0) (******, ******, ******)
    I-UCP: (0, 0, 0) (******, ******, ******)
Macro-average precision, recall, F1: (0.639239, 0.602512, 0.611086)
Item accuracy: 45422 / 47321 (0.9599)
Instance accuracy: 1176 / 2011 (0.5848)
Elapsed time: 0.940000 [sec] (2140.4 [instance/sec])
          When we improve the accuracy of a CRF model by tweaking the feature set, it may be useful to see the feature weights assigned by a trainer. You cannot simply read the model file since CRFsuite stores models in a binary format for the efficiency reason. Therefore, you need to use the dump command to read a model in plain text format.
$ crfsuite dump CoNLL2000.model
FILEHEADER = {
  magic: lCRF
  size: 28242501
  type: FOMC
  version: 100
  num_features: 0
  num_labels: 23
  num_attrs: 338547
  off_features: 0x30
  off_labels: 0x8B4EE4
  off_attrs: 0x8B5A0C
  off_labelrefs: 0x169C145
  off_attrrefs: 0x169C515
}
LABELS = {
      0: B-NP
      1: B-PP
      2: I-NP
      3: B-VP
      4: I-VP
      5: B-SBAR
      6: O
      7: B-ADJP
      8: B-ADVP
      9: I-ADVP
     10: I-ADJP
     11: I-SBAR
     12: I-PP
     13: B-PRT
     14: B-LST
     15: B-INTJ
     16: I-INTJ
     17: B-CONJP
     18: I-CONJP
     19: I-PRT
     20: B-UCP
     21: I-UCP
     22: I-LST
}
ATTRIBUTES = {
      0: U00=
      1: U01=
      2: U02=Confidence
      3: U03=in
      4: U04=the
      5: U05=/Confidence
      6: U06=Confidence/in
      7: U10=
... (snip) ...
}
TRANSITIONS = {
  (1) B-NP --> B-NP: 2.327985
  (1) B-NP --> B-PP: 4.391125
  (1) B-NP --> I-NP: 30.372649
  (1) B-NP --> B-VP: 7.725525
  (1) B-NP --> B-SBAR: 1.821388
  (1) B-NP --> O: 3.805715
  (1) B-NP --> B-ADJP: 4.801651
  (1) B-NP --> B-ADVP: 3.842473
... (snip) ...
}
TRANSITIONS_FROM_BOS = {
  (2) BOS --> B-NP: 17.875605
  (2) BOS --> B-PP: -0.318745
  (2) BOS --> I-NP: -4.387101
  (2) BOS --> B-VP: -0.383031
  (2) BOS --> I-VP: -1.163315
  (2) BOS --> B-SBAR: 1.368176
  (2) BOS --> O: 2.783132
... (snip) ...
}
TRANSITIONS_TO_EOS = {
  (3) B-NP --> EOS: 16.156051
  (3) B-PP --> EOS: -1.045312
  (3) I-NP --> EOS: -2.762051
  (3) B-VP --> EOS: -0.767247
  (3) I-VP --> EOS: -1.113502
  (3) B-SBAR --> EOS: -2.407145
  (3) O --> EOS: 4.131429
... (snip) ...
}
STATE_FEATURES = {
  (0) U00= --> B-NP: -2.622045
  (0) U00= --> B-PP: -1.562976
  (0) U00= --> I-NP: -2.555526
  (0) U00= --> B-VP: -1.329829
  (0) U00= --> I-VP: -1.152970
  (0) U00= --> B-SBAR: -2.590170
  (0) U00= --> O: -1.584688
  (0) U00= --> B-ADJP: -1.526879
... (snip) ...
}
          
This tutorial used chunking.py bundled in the CRFsuite distribution to extract attributes from a data set.
In practice, one may need to implement an attribute extractor suitable for a target task.
One can write an attribute extractor in any manner in any programming language.
However, if your data has a fixed length of fields (like the CoNLL 2000 data set) and if you are familiar with Python, it may be a good idea to modify the code of chunking.py. This section explains the structure of the script chunking.py for those who may modify it.
Here is the implementation of chunking.py:
#!/usr/bin/env python
"""
An attribute extractor for chunking.
Copyright 2010,2011 Naoaki Okazaki.
"""
# Separator of field values.
separator = ' '
# Field names of the input data.
fields = 'w pos y'
# Attribute templates.
templates = (
    (('w', -2), ),
    (('w', -1), ),
    (('w',  0), ),
    (('w',  1), ),
    (('w',  2), ),
    (('w', -1), ('w',  0)),
    (('w',  0), ('w',  1)),
    (('pos', -2), ),
    (('pos', -1), ),
    (('pos',  0), ),
    (('pos',  1), ),
    (('pos',  2), ),
    (('pos', -2), ('pos', -1)),
    (('pos', -1), ('pos',  0)),
    (('pos',  0), ('pos',  1)),
    (('pos',  1), ('pos',  2)),
    (('pos', -2), ('pos', -1), ('pos',  0)),
    (('pos', -1), ('pos',  0), ('pos',  1)),
    (('pos',  0), ('pos',  1), ('pos',  2)),
    )
import crfutils
def feature_extractor(X):
    # Apply feature templates to obtain features (in fact, attributes)
    crfutils.apply_templates(X, templates)
    if X:
        # Append BOS and EOS features manually
        X[0]['F'].append('__BOS__')     # BOS feature
        X[-1]['F'].append('__EOS__')    # EOS feature
if __name__ == '__main__':
    crfutils.main(feature_extractor, fields=fields, sep=separator)
            
The implementation is very simple because the common staffs (attribute generation from templates, data I/O, etc) are implemented in other modules, crfutils.py and template.py.
The script chunking.py defines three important variables:
- separator
- separator character(s) of an input data; this can be overwritten by a command-line argument ("-s" option); this example assumes a space character as a separator of fields.
- fields
- field name(s) (ordered from left to right) of an input data, separated by a space character; this can be overwritten by a command-line argument ("-f" option); this example assumes that each line of an input data consists of fields named "w", "pos", and "y".
- templates
- attribute (feature) templates written as a Python tuple/list object
It may be sufficient to modify these variables for your input data.
Each element in the templates is a tuple/list of (name, offset) pairs, in which name presents a field name, and offset presents an offset to the current position.For example, the tuple,
    (('w', -2), ),
            
extract the value of 'w' field at two tokens ahead of the current position. Note that the comma after ('w', -2) is necessary to define a tuple with one element ('w', -2). The tuple,
    (('w', -1), ('w',  0)),
            defines an attribute that concatenates the value of 'w' field at the previous token and the value of 'w' field at the current position (i.e., the bigram starting at the previous token).
The function feature_extractor receives a sequence of items (X in this example) read from the input data, and generates necessary attributes.
The argument X presents a list of items; each item is represented by a mapping (dictionary) object from field names to their values.
In the CoNLL chunking task, X[0]['w'] presents the word of the first item in the sequence, X[0]['pos'] presents the part-of-speech tag of the last item in the sequence.
The mapping object of each item in X has a special key 'F' whose value (a list object) stores attributes generated for the item.
Each element of an attribute list must be either a string or a tuple of (name, value) (an attribute with a weight).
In the script chunking.py, feature templates are applied by using crfutils.apply_templates (which may fill attribute lists). The example also generates special attributes, "__BOS__" and "__EOS__" at the beginning and end of a sequence.
In addition to chunking.py, the example directory contains:
- pos.pyfor part-of-speech tagging. The structure of the code is similar to that of- chunking.py; only the field descriptor and attribute templates are different.
- ner.pyfor named entity recognition. The script is more complicated because it extract various characteristics (e.g., character shape, prefixes and suffixes of tokens) from the input data.
- template.pyfor using feature templates compatible with CRF++. Features conditioned with attributes and label bigrams are not supported.