***Bregman Ball Trees (bbtrees)***
Lawrence Cayton
work@lcayton.com
(C) Copyright 2008, Lawrence Cayton
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
with this program. If not, see .
-----------------------------------------------------------
This is a C implementation of the bregman ball tree data structure
described in
L. Cayton, Fast nearest neighbor retrieval for bregman divergences.
Twenty-Fifth International Conference on Machine Learning (ICML), 2008.
The code provides an implementation of the build, search, and
approximate search algorithms described in the paper. The software
currently supports the following divergences:
* l_2^2
* KL-divergence (relative entropy)
* The conjugate to the KL-divergence (exponential)
* Itakura-Saito divergence
It is simple to extend to other divergences.
The code is not seriously optimized; it can probably be sped up
with a bit of programming expertise.
-----------------------------------------------------------
FILES
* bbtree.{h,c} -- contains the build algorithm and the definition of
the bbtree data structure.
* bregdiv.{h,c} -- defines the bregdiv data structure and provides
implementations of the supported divergences. The bregdiv data
structure is simply a struct with function pointers to the appropriate
divergence functions.
* search.{h,c} -- contains the search algorithms.
* test.c -- code for experimenting with the bbtree data
structure.
* utils.{h,c} -- contains procedures for saving and loading bbtrees,
and a variety of other utilities used by the bbtree software. All
constants are defined here.
-----------------------------------------------------------
COMPILING
Type make in a shell. Compilation requires GCC.
-----------------------------------------------------------
USING
To deploy the bbtree data structure in an application, you will likely
need to integrate this code into yours. The test.c file provides an
example of how to use the software. To experiment with it, type
$ testBBT
at the prompt and a list of options will be displayed.
The default bucketsize is 50, which was used in the exact NN
experiments in the paper. A bucketsize of 10 was used for the
approximate NN experiments in the paper---this is slower and was
only used so that the tradeoff between execution time and solution
quality could be examined closely.
Approximate NN search can be accomplished in two ways. First, one can
pass epsilon > 0 to the search procedure. The returned point is
guaranteed to be a (1+epsilon)-approximate NN---i.e. a point x is
returned that satisfies
d(x,q) <= (1+epsilon)d(x_q,q),
where x_q is the true NN. This functionality was not explored in the
paper. The second method is by early stopping. To use this feature,
one must specify a bound on the number of leaves that will be
explored. The approximate NN plots in the paper were produced by
varying the leaf bound from 2^0 to 2^{log n} (i.e., full exploration)
in increments of powers of 2.
To add your own bregman divergence, follow the prescription in
bregdiv.{h,c}. You must create functions for computing
* the divergence between two vectors;
* the gradient;
* the gradient of the conjugate function (ie, the inverse of the
gradient);
* the matrix of divergences between two sets of vectors.
These functions are then encapsulated in a struct of function pointers.
-----------------------------------------------------------
DATA FORMAT FOR THE TEST PROGRAM
The data must be in plain text, one vector per line, and separated by
spaces. All points are read as doubles.
-----------------------------------------------------------
ISSUES
Unbounded interpoint distances can trip up the k-means implementation
used in the build procedure. The problem is that k-means is currently
initialized with a random point, rather than an average.
-----------------------------------------------------------
QUESTIONS, COMMENTS, ETC
Please contact work@lcayton.com with any bug reports, questions,
comments, suggestions, significant optimizations, etc. I'm also
interested to hear about any applications of the bbtree.