DASTrie Tutorial

Preparation

Put the header file "dastrie.h" to a INCLUDE path, and include the file to a C++ source code. It's that simple.
#include <dastrie.h>

Customizing a builder type

First of all, we need to design the types of records (keys and values) for a trie, and derive a specialization of a builder. Key and record types can be specified by the first and second template arguments of dastrie::builder,
dastrie::builder<key_type, value_type, traits_type>

A key type can be either char* or std::string for your convenience. The string class std::string is usually more convenient than char*, but some may prefer char* for efficiency, e.g., allocating a single memory block that can store all of keys and reading keys from a file at a time.

If you would like dastrie::trie to behave like std::set, use dastrie::empty_type as a value type, which is a dummy value type that serializes nothing with a tail array. This is an example of a trie builder without values in which keys are represented by std::string,

typedef dastrie::builder<std::string, dastrie::empty_type> builder_type;

If you would like to use dastrie::trie in a similar manner as std::set, specify a value type to the second template argument. This is an example of a trie builder whose keys are char* and values are double,

typedef dastrie::builder<char*, double> builder_type;

DASTrie supports the following types as a value type : bool, short, unsigned short, int, unsigned int, long, unsigned long, float, double, long double, char*, std::string. In addition, it is possible to use a user-defined type as a value type for a trie. In order to do this, you must define operator<<() and operator>>() for serializing values with a tail array. This is an example of a value type (string_array) that holds an array (vector) of strings and reads/writes the number of strings and their actual strings from/to a tail array,

class string_array : public std::vector<std::string>
{
public:
    friend dastrie::itail& operator>>(dastrie::itail& is, string_array& obj)
    {
        obj.clear();

        uint32_t n;
        is >> n;
        for (uint32_t i = 0;i < n;++i) {
            std::string dst;
            is >> dst;
            obj.push_back(dst);
        }

        return is;
    }

    friend dastrie::otail& operator<<(dastrie::otail& os, const string_array& obj)
    {
        os << (uint32_t)obj.size();
        for (size_t i = 0;i < obj.size();++i) {
            os << obj[i];
        }
        return os;
    }
};

typedef dastrie::builder<std::string, string_array> builder_type;

The third template argument (traits_type) of a builder class is to customize the size of double-array elements. By default, DASTrie uses 5 bytes per double-array element, which can store 0x7FFFFFFF elements in a trie. This footprint is already smaller than other double-array implementations, you can choose 4 bytes per element and save 1 byte per element if your data is small enough to be stored with no longer than 0x007FFFFF elements (note that the number of elements is different from the number of records). Specify dastrie::doublearray4_traits at the third argument for implementing a double array with 4 bytes per element.

Building a trie

After designing a builder class, prepare records to be stored in a trie. You can use dastrie::builder::record_type to define a record type.
typedef builder_type::record_type record_type;

An instance of dastrie::builder::record_type consists of two public member variables, dastrie::builder::record_type::key and dastrie::builder::record_type::value. Records can be represented by an array of records, e.g.,

record_type records[10] = {
    {"eight", 8},   {"five", 5},    {"four", 4},    {"nine", 9},
    {"one", 1},     {"seven", 7},   {"six", 6},     {"ten", 10},
    {"three", 3},   {"two", 2},
};
or a vector of records, e.g.,
std::vector<record_type> records;

Make sure that records are sorted by dictionary order of keys. It may be a good idea to use STL's sort() function if you have unsorted records.

Now you are ready to build a trie. Instantiate the builder class,

builder_type builder;
If necessary, set a callback function by using dastrie::builder::set_callback() to receive progress reports in building a trie. Refer to build.cpp for an actual usage.

Call dastrie::builder::build to build a trie from records. The first argument is the iterator (pointer) addressing to the first record, and the second argument is the iterator (pointer) addressing one past the final record.

builder.build(records, records + 10);

You can store the newly-built trie to a file by using dastrie::builder::write. This method outputs the trie to a binary stream (std::ostream).

std::ofstream ofs("sample.db", std::ios::binary);
builder.write(ofs);

Accessing a trie

The class dastrie::trie provides the read access to a trie. The first template argument specifies the value type of a trie, and the second argument customizes the double array.
dastrie::trie<value_type, traits_type>
The value type and traits for dastrie::trie are required to be the same as those specified for dastrie::builder.

This example defines a trie class trie_type that can access records whose values are represented by int.

typedef dastrie::trie<int> trie_type;

The class dastrie::trie can read a trie structure from an input stream; the dastrie::trie::read() function prepares a trie from an input stream (std::istream). This example reads a trie from an input stream.

std::ifstream ifs("sample.db", std::ios::binary);
trie_type trie;
trie.read(ifs);

Alternatively, you can read a trie structure from a memory block by using the dastrie::trie::assign() function. This function may be useful when you would like to use mmap() API for reading a trie.

Now you are ready to access the trie. Please refer to the sample code for retrieving a record (dastrie::trie::get() and dastrie::trie::find()), checking the existence of a record (dastrie::trie::in()), and retrieving records that are prefixes of keys (dastrie::trie::prefix()).


Copyright (c) 2002-2008 by Naoaki Okazaki
Mon Nov 10 12:28:34 2008