Classias - Usage

How to install

For Windows environments, the most easiest way is to use the binary distribution of Classias. This binary requires Microsoft Visual C++ 2008 SP1 Redistributable Package to be installed on your computers. For Linux environments, build a binary from the source distribution. Refer to this section for more information about compilation.

Three task types and data formats

Classias supports three task types: binary classificationmulti-class classification, and candidate classification.

Binary classification

Binary classification classifies a given instance into positive (+1) or negative (-1) based on the features (attributes) of the instance. An instance label must be either "+1", "1", or "-1", followed by the instance weight represented by a colon ":" and the weight value. If an instance weight is omitted, Classias sets the instance weight 1.0. Classias accepts a string as a feature name (identifier), followed by the value of the feature represented by a colon ":" and the feature value. If a feature value is omitted, Classias sets the feature value 1.0. This is an example of binary classification in the application of sentiment analysis (polarity identification).

+1 this:0.447 soup:0.447 has:0.447 substantial:0.447 broth:0.447
-1 this:0.408 soup:0.408 contains:0.408 strong:0.408 chemical:0.408 additive:0.408

This is the BNF notation of the data format of binary classification. This format is upward compatible with that of SVMlight and LIBSVM. It is necessary to use integer values as feature identifiers for SVMlight and LIBSVM, Classias can accepts strings as feature identifiers. The separator character between a label and features is a space (configurable), and the separator between a name and value is a colon (configurable). Make a label field empty for unlabeled data. It is not allowed to omit a label field.

<line>       ::= <comment> | <instance>
<comment>    ::= '#' <character>+ '\n'
<instance>   ::= <label> (' ' <feature>)+ '\n'
<label>      ::= <class> | <class> ':' <weight>
<feature>    ::= <name> | <name> ':' <weight>
<class>      ::= "+1" | "1" | "-1"
<name>       ::= (<letter> | ^' ' | ^':')+
<weight>     ::= <numeric>

Multi-class classification

Multi-class classification classifies an instance into a label in a fixed set of labels based on the attributes of the instance. An instance label can a string, followed by the instance weight represented by a colon ":" and the weight value. If an instance weight is omitted, Classias sets the instance weight 1.0. Classias accepts a string as an attribute name (identifier), followed by the value of the attribute represented by a colon ":" and the attribute value. If an attribute value is omitted, Classias sets the attribute value 1.0. This is an example of multi-class classification in the application of query categorization. In this data set, the query "new york yankees" should be classified into the sport category, "new york statue of liberty" into the travel category, and "new york dow intel" into the finance category.

sport new york yankees
travel new york statue of liberty
finance new york dow intel

This is the BNF notation of the data format of multi-class classification. This format is upward compatible with that of SVMlight and LIBSVM. It is necessary to use integer values as attribute identifiers for SVMlight and LIBSVM, Classias can accepts strings as attribute identifiers. The separator character between a label and features is a space (configurable), and the separator between a name and value is a colon (configurable). Make a label field empty for unlabeled data. It is not allowed to omit a label field.

<line>       ::= <comment> | <instance>
<comment>    ::= '#' <character>+ '\n'
<instance>   ::= <label> (' ' <feature>)+ '\n'
<label>      ::= <name> | <name> ':' <weight>
<feature>    ::= <name> | <name> ':' <weight>
<name>       ::= (<letter> | ^' ' | ^':')+
<weight>     ::= <numeric>

Classias generates features automatically by combining instance labels and attributes. Therefore, we do not call elements representing the characteristics of an instance as features but attributes.

Candidate classification

Candidate classification chooses the most suitable candidate from a given set of candidates for an instance. For example, we can formalize the lemmatization task as candidate classification, where a model chooses the most suitable lemma in a given set of four candidate lemmas, {"studie", "studi", "study", "stud"}. We first show an example of a training data.

@boi
# studie
- s_ es_e ies_ie dies_die udies_udie tudies_tudie studies_studie
# studi
- es_ ies_i dies_di udies_udi tudies_tudi studies_studi
# study
+ ies_y dies_dy udies_udy tudies_tudy studies_study
# stud
- ies_ dies_d udies_ud tudies_tud studies_stud
@eoi

@boi
...
@eoi

An instance begins with a line "@boi", and terminates with a line "@eoi". An instance contains candidates in the lines between "@boi" and "@eoi" one candidate per line. For the above example, Classias ignores lines starting with "#" as comment lines; thus, the instance contains four candidates. A candidate line starts with the character '+' if the candidate is true or '-' otherwise, followed by a list of features. The data set represents that the third candidate (in the line just after the comment line "# study") should be chosen. It is possible to use strings as feature identifiers of each candidate. In this example, we design features that represent substitution rules from an input string into a lemma.

This is the BNF notation of the data format of candidate classification. A candidate label begins with either '+' (true candidate) or '-' (false candidate). It is possible to give an (omissible) label name to each candidate by appending the name just after '+' or '-'. Label names are used only for computing precision, recall, and F1 scores. The separator character between a label and features is a space (configurable), and the separator between a name and value is a colon (configurable). Make a label field empty for unlabeled data. It is not allowed to omit a label field.

<line>          ::= <comment> | <boi> | <eoi> | <unreg> | <candidate> | '\n'
<comment>       ::= "#" <character>+ '\n'
<boi>           ::= "@boi" [ ':' <weight> ] '\n'
<eoi>           ::= "@eoi" '\n'
<unregularize>  ::= "@unregularize" (' ' <name>)+ '\n'
<instance>      ::= <class> [<name>] (' ' <feature>)+ <br>
<class>         ::= ('+' | '-')
<feature>       ::= <name> [ ':' <weight> ]
<name>          ::= (<letter> | ^' ' | ^':')+
<weight>        ::= <numeric>

Usage of Classias frontend utilities

The Classias distribution includes frontend utilities for training a model from labeled data (classias-train) and for predicting labels by using the trained model (classias-tag).

Training

Basic usage

The name of the utility is classias-train.

$ classias-train [OPTIONS] [DATA1] [DATA2] [...]

If no DATA is specified to command-line arguments, the utility reads a training data set from STDIN. Classias supports text files compressed by gzip, bzip2, and xz. Classias focuses on the suffix of a file name to determine the compression format; Classias invokes gzip for file names ending with ".gz", bzip2 with ".bz2", and xz with ".xz". Classias assigns group numbers for the given data files from 1. These group numbers are used to specify the data set for holdout evaluation.

Specify a task type by using -t option. A task type is either binary classification (-tb), candidate classification (-tc), multi-class classification (-tm or -tn). We can choose either a sparse feature set (-tn) or a dense feature set (-tm) for multi-class classification. A dense feature set is generated by all of possible combinations of attributes and labels in a training set. This broadens the feature spase (and memory requirement), but the training speed may be faster due to the fast conversion between attribute-label space and feature space. A sparse feature set is generated by all of combinations of attributes and labels that exist in the training data. This makes the feature space compact, but the training speed may be slower. It is recommended to use a sparse feature set (-tn) for large-scaled problems.

Specify a training algorithm by using -a option. Classias 1.0 provides training algorithms listed below (currently, SVM is available for binary classification only).

  • L1-regularized logistic regression (maximum entropy): "lbfgs.logistic", "truncated_gradient.logistic"
  • L2-regularized logistic regression (maximum entropy): "lbfgs.logistic", "pegasos.logistic"
  • L1-regularized L1-loss SVM: "truncated_gradient.hinge"
  • L2-regularized L1-loss SVM: "pegasos.hinge"
  • averaged perceptron: "averaged_perceptron"

In general, the performance of each algorithm depends on a training set. Nonetheless, when choosing a training algorithm, consider the following tendencies that are roughly true in general.

  • L1 regularization can train a model with feature selection.
  • L2 regularization appears to yield the better accuracy than L1 regularization does.
  • SVM may yield the a little better accuracy than logistic regression.
  • Logistic regression can compute the conditional probability of labels.
  • Averaged Perceptron is quite fast for training, but yields the lower accuracy than other algorithms.
  • Online-training algorithms (pegasos, truncated_gradient, averaged_perceptron) approach to the solution more quickly than batch-training algorithms (lbfgs), but have no big difference in the final classification accuracy.

This is the example applying Pegasos SVM to the data set of binary classification rcv1_train.binary.bz2.

$ classias-train -tb -a pegasos.hinge rcv1_train.binary.bz2
Classias 1.0 trainer Copyright (c) 2008,2009 Naoaki Okazaki

Task type: binary
Training algorithm: pegasos.hinge
Instance shuffle: false
Bias feature value: 1
Model file:
Instance splitting: 0
Holdout group: -1
Cross validation: false
Attribute filter:
Start time: 2009-09-21T15:13:51Z

Reading the data set from 1 files
- 1 (bzip2): rcv1_train.binary.bz2
Number of instances: 20242
Number of groups: 1
Number of attributes: 44504
Number of labels: 2
Number of features: 44504
Seconds required: 1.77

Pegasos for linear classifier (binary) with hinge loss
c: 1
n: 20242
eta: 0.1
sample: shuffle
max_iterations: 1000
period: 20
epsilon: 0.0001

***** Iteration #1 *****
Loss: 5929.56
Feature L2-norm: 26.3076
Learning rate (eta): 0.083334
Total number of feature updates: 20242
Seconds required for this iteration: 0.04

***** Iteration #2 *****
Loss: 3448.83
Feature L2-norm: 29.3977
Learning rate (eta): 0.0714291
Total number of feature updates: 40484
Seconds required for this iteration: 0.04

 (...snip...)

***** Iteration #124 *****
Loss: 2628.7
Feature L2-norm: 36.5945
Learning rate (eta): 0.00387597
Total number of feature updates: 2510008
Loss variance: 9.5071e-05
Seconds required for this iteration: 0.03

Terminated with the stopping criterion

Seconds required: 4.32

Finish time: 2009-09-21T15:13:57Z

The next command-line applies logistic regression using L-BFGS for the data set rcv1_train.binary.bz2, and stores the trained model into the file rcv1.binary.model.

$ classias-train -tb -a lbfgs.logistic -m rcv1.binary.model rcv1_train.binary.bz2

Evaluation on training

Classias supports holdout evaluation and cross validation to evaluate the model on training. A holdout evaluation computes the performance of the model on the test data set. A N-fold cross validation splits the training set into N groups, and repeats the following process N times: train a model by using (N-1) groups, and evaluate the model on the remaining one group. In order to use the holdout evaluation, give data sets for training and testing in separate files, and specify the group number of a test set. This is an example where Classias reads news20.bz2 and news20.t.bz2 for multi-class classification, and evaluates the model on the data set with group number 2 (news20.t.bz2).

$ classias-train -tn -e2 -a lbfgs.logistic news20.bz2 news20.t.bz2
Classias 1.0 trainer Copyright (c) 2008,2009 Naoaki Okazaki

Task type: multi-sparse
Training algorithm: lbfgs.logistic
Instance shuffle: false
Bias feature value: 1
Model file:
Instance splitting: 0
Holdout group: 2
Cross validation: false
Attribute filter:
Start time: 2009-09-21T16:09:06Z

Reading the data set from 2 files
- 1 (bzip2): news20.bz2
- 2 (bzip2): news20.t.bz2
Number of instances: 19928
Number of groups: 2
Number of attributes: 62061
Number of labels: 20
Number of features: 252009
Seconds required: 1.32

Multi-class logistic regression using L-BFGS
c1: 0
c2: 1
lbfgs.num_memories: 6
lbfgs.epsilon: 1e-05
lbfgs.stop: 10
lbfgs.delta: 1e-05
lbfgs.max_iterations: 2147483647
lbfgs.linesearch: MoreThuente
lbfgs.max_linesearch: 20
lbfgs.regularization_start: 0

***** Iteration #1 *****
Loss: 41104.3
Feature L2-norm: 1
Error norm: 6065.1
Active features: 250043 / 252009
Line search trials: 1
Line search step: 0.000104118
Seconds required for this iteration: 9.78
Accuracy: 0.5352 (2137/3993)
Micro P, R, F1: 0.5352 (2137/3993), 0.5352 (2137/3993), 0.5352
Macro P, R, F1: 0.6927, 0.5350, 0.5273

 (...snip...)

***** Iteration #198 *****
Loss: 4003.62
Feature L2-norm: 44.7856
Error norm: 2.27719
Active features: 250043 / 252009
Line search trials: 1
Line search step: 1
Seconds required for this iteration: 5.63
Accuracy: 0.8232 (3287/3993)
Micro P, R, F1: 0.8232 (3287/3993), 0.8232 (3287/3993), 0.8232
Macro P, R, F1: 0.8244, 0.8231, 0.8234

L-BFGS terminated with the stopping criteria
Seconds required: 1185.1

Finish time: 2009-09-21T16:28:52Z

In order to use N-fold cross validation, specify the number of data splits (N) to -g option, and activate cross validation using -x option. Classias splits the data set automatically for N-fold cross validation.

$ classias-train -tb -g5 -x rcv1_train.binary.bz2
Classias 1.0 trainer Copyright (c) 2008,2009 Naoaki Okazaki

Task type: binary
Training algorithm: truncated_gradient.hinge
Instance shuffle: true
Bias feature value: 1
Model file:
Instance splitting: 5
Holdout group: -1
Cross validation: true
Attribute filter:
Start time: 2009-09-21T05:10:11Z

Reading the data set from 1 files
- 1 (bzip2): rcv1_train.binary.bz2
Number of instances: 20242
Number of groups: 5
Number of attributes: 44505
Number of labels: 2
Number of features: 44505
Seconds required: 1.86

===== Cross validation (1/5) =====
Truncated Gradient for linear classifier (binary) with hinge loss
c: 1
n: 20242
eta: 0.1
truncate_period: 1
sample: shuffle
max_iterations: 1000
period: 20
epsilon: 0.0001

***** Iteration #1 *****
Loss: 5998.9
Feature L1-norm: 1136.92
Feature L2-norm: 649.36
Learning rate (eta): 0.0925928
Active features: 14311 / 44505
Total number of feature updates: 16192
Seconds required for this iteration: 0.04
Accuracy: 0.9518 (3854/4049)
Micro P, R, F1: 0.9625 (2025/2104), 0.9458 (2025/2141), 0.9541

 (...snip...)

***** Iteration #307 *****
Loss: 2369.45
Feature L1-norm: 1363.55
Feature L2-norm: 2910.78
Learning rate (eta): 0.0039125
Active features: 8283 / 44505
Total number of feature updates: 4971250
Loss variance: 7.91011e-05
Seconds required for this iteration: 0.04
Accuracy: 0.9637 (3902/4049)
Micro P, R, F1: 0.9650 (2069/2144), 0.9664 (2069/2141), 0.9657

Terminated with the stopping criterion

Seconds required: 13.32

===== Cross validation (2/5) =====
Truncated Gradient for linear classifier (binary) with hinge loss
c: 1
n: 20242
eta: 0.1
truncate_period: 1
sample: shuffle
max_iterations: 1000
period: 20
epsilon: 0.0001

 (...snip...)

===== Cross validation (5/5) =====

 (...snip...)

***** Iteration #319 *****
Loss: 2417.71
Feature L1-norm: 1374.15
Feature L2-norm: 2919.17
Learning rate (eta): 0.00377065
Active features: 7753 / 44505
Total number of feature updates: 5165885
Loss variance: 9.42357e-05
Seconds required for this iteration: 0.04
Accuracy: 0.9708 (3930/4048)
Micro P, R, F1: 0.9725 (2019/2076), 0.9707 (2019/2080), 0.9716

Terminated with the stopping criterion

Seconds required: 13.82

Finish time: 2009-09-21T05:11:23Z

Options for accuracy and speed

In addition to the training algorithm itself, parameters of the training algorithm greatly affect the performance of the training and classification. Parameters for training algorithms can be configurable by using -p (--set) option.

The most important parameter for training algorithms is the regularization coefficient, which controls the penalty for overfitting. A smaller regularization coefficient favors models that fit well to the training data, and may cause the overfitting problem. A larger regularization coefficient favors simple models to avoid overfitting, but may cause the underfitting problem. Unfortunately, there is no generic rule to determine the regularization parameter because the optimal value depends on the nature of the data set. We usually tune this parameter by measuring the accuracy of the model on a development data set. The regularization coefficient is configurable by "c" ("c1" or "c2" for lbfgs.logistic algorithm).

Some online training algorithms have the parameter for control the learning rate "eta" and for detecting convergence "epsilon". These parameters also affect the accuracy and speed of training.

Classification

The name of the utility is classias-tag.

$ classias-tag [OPTIONS]

This utility reads a data set from STDIN, and writes the classification results into STDOUT.

Specify a model file for classification by using -m option. This is an example where Classias tags the instances in rcv1_test.binary.bz2 with the model rcv1.binary.model.

$ bzcat rcv1_test.binary.bz2 | classias-tag -m rcv1.binary.model
 (...snip...)
-1
-1
-1
+1
-1
+1
+1
+1
-1
-1
 (...snip...)

Add -w option to see the scores of label predictions. Classias outputs a label prediction followed by a colon (':') and the score for the instance.

$ bzcat rcv1_test.binary.bz2 | classias-tag -m rcv1.binary.model -w
 (...snip...)
-1:-1.677
-1:-1.03897
-1:-0.812968
+1:0.653437
-1:-0.591461
+1:0.936539
+1:0.98442
+1:0.444249
-1:-2.6872
-1:-1.17141
 (...snip...)

Add -p option to see the probability estimates of label predictions. Classias outputs a label prediction followed by a colon (':') and the probability for the instance. The probability estimates are valid only for the models trained by logistic regression.

$ bzcat rcv1_test.binary.bz2 | classias-tag -m rcv1.binary.model -p
 (...snip...)
-1:0.157493
-1:0.261349
-1:0.307258
+1:0.657785
-1:0.3563
+1:0.7184
+1:0.727984
+1:0.609271
-1:0.0637331
-1:0.236601
 (...snip...)

The utility classias-tag assumes labels annotated with the instances in the input data as reference labels. Use -r option to output the reference label and the predicted label side-by-side for each instance. This feature is useful for measuring performance (e.g., accuracy) of a classifier.

$ bzcat rcv1_test.binary.bz2 | classias-tag -m rcv1.binary.model -r
 (...snip...)
-1 -1
-1 -1
-1 -1
+1 +1
-1 +1
+1 +1
+1 +1
+1 +1
-1 -1
-1 -1
 (...snip...)

Use -f option to view classification results of instances where predictions are unsuccessful.

$ bzcat rcv1_test.binary.bz2 | classias-tag -m rcv1.binary.model -f
 (...snip...)
-1 +1
+1 -1
+1 -1
+1 -1
+1 -1
-1 +1
-1 +1
+1 -1
-1 +1
+1 -1
 (...snip...)

The utility classias-tag can measure accuracy, precision, recall, and F1 scores (-t option). It may be convenient to use the option with -q option, which suppress label predictions from the output.

$ bzcat rcv1_test.binary.bz2 | classias-tag -m rcv1.binary.model -qt
Accuracy: 0.9632 (652464/677399)
Micro P, R, F1: 0.9648 (343048/355571), 0.9651 (343048/355460), 0.9649

Usages for other task types (multiclass classification and candidate selection) are the same. It is unecessary to specify a task type in a command-line argument because classias-tag detects a task type from a model file.

$ bzcat news20.t.bz2 | classias-tag -m news20.model
2
4
2
2
2
2
2
12
2
2
2
 (...snip...)

Use -a option to output information for each candidate label of an instance. This feature accompanied with -p option reports probability estimates of candidate labels. The following example shows that the probability of the instance classified as label "2" is 0.99977 (highest of all candidates) and that the classifier predicts "2" as the label for the instance (marked with '+').

$ bzcat news20.t.bz2 | classias-tag -m news20.model -ap
@boi
-1:6.54488e-06
+2:0.99977
-3:2.80506e-13
-4:3.09902e-13
-5:0.000223139
-6:5.02903e-12
-7:5.59272e-14
-8:3.59528e-14
-9:9.96966e-13
-10:8.64676e-15
-11:1.26634e-13
-12:3.97081e-13
-13:4.16725e-14
-14:7.44803e-16
-15:2.57823e-15
-16:1.41428e-15
-17:1.07985e-15
-18:5.30595e-14
-19:2.34302e-13
-20:1.2039e-13
@eoi
 (...snip...)

In addition, it may be useful to combine the option -a with options -r and -f. This is an example of false analysis where the classifier predicts "4" as the label for the instance although the reference label is "2".

$ bzcat news20.t.bz2 | classias-tag -m news20.model -fap
@boi
--1:0.0654903
+-2:0.167662
--3:0.109992
-+4:0.251968
--5:0.204841
--6:0.012788
--7:0.019802
--8:0.0267658
--9:0.00712349
--10:0.020578
--11:0.0186353
--12:0.0121116
--13:0.0120979
--14:0.0119843
--15:0.0108746
--16:0.00433983
--17:0.0117268
--18:0.00491985
--19:0.0068262
--20:0.0194728
@eoi
 (...snip...)

How to compile

Build with Visual Studio 2008 SP1

In order to compile Classias with Visual Studio 2008 SP1, open the solution file classias.sln. Classias requires the header and library files of libLBFGS. Create a directory win32/liblbfgs under the root directory of the distribution, and place lbfgs.h, lbfgs.lib (Release build), lbfgs_debug.lib (Debug build).

Classias uses unordered_map and regex from the Technical Report 1 (TR1) for the next generation of the C++ standard. For this reason, Classias requires Visual Studio 2008 Service Pack 1 or later.

Build with g++

This section shows how to compile Classias with g++. In order to build Classias, it is necessary to have libLBFGS installed on the computer. In addition, Classias tries to use unordered_map from the Technical Report 1 (TR1) or Boost library if available. It is possible to build Classias without this class, but recommended to have unordered_map, which affects the performance greatly.

Firstly, build libLBFGS and install it. Since an installation often requires the root privilege, some may want to specify an install directory by specifying --prefix option to configure script.

$ wget http://www.chokkan.org/software/dist/liblbfgs-1.8.tar.gz
$ tar xvzf liblbfgs-1.8.tar.gz
$ ./configure [--prefix=/path/to/a/temporary/directory]
$ make
$ make install

Now we are ready to build Classias. Run the configure script of Classias. If libLBFGS is installed in a different directory, specify the directory name by using --with-liblbfgs option for the configure script.

$ ./configure [--with-liblbfgs=/path/to/a/temporary/directory]
$ make
$ make install

If g++ and libstdc++ support unordered_map, the above procedure may be enough to run Classias. If it is unfortunate that the build environment (g++) does not support unordered_map, you may want to have Boost library installed. Specify the directory for Boost include headers with --with-boost-include (e.g., ~/local/include/boost-1_39).

$ ./configure --with-boost-include=~/local/include/boost-1_39 \
              [--with-liblbfgs=/path/to/a/temporary/directory]
$ make
$ make install