Dissimilarity Datasets for Pattern Recognition |
On this page information and links are given for a number of dissimilarity datasets collected by the Pattern Recognition Lab of Delft University of Technology. They are stored as Matlab's mat-files and are in the dataset format of PRTools. Without PRTools it is a Matlab structure d with the data in d.data, a label index in d.nlab pointing to class names in d.lablist. A number of these datasets are collected with support of the FET programme within the EU FP7, under the SIMBAD project (contract 213250).
Name | # matrices | # classes | # objects |
Balls3D | 1 | 2 | 200 |
Balls50D | 1 | 4 | 2000 |
CatCortex | 1 | 4 | 65 |
Chickenpieces | 44 | 5 | 446 |
CoilDelftDiff, | 1 | 4 | 288 |
CoilDelftSame | 1 | 4 | 288 |
CoilYork | 1 | 4 | 288 |
DelftGestures | 1 | 20 | 1500 |
FlowCyto | 4 | 3 | 612 |
GaussM1 | 1 | 2 | 2000 |
GaussM02 | 1 | 2 | 2000 |
NewsGroups | 1 | 4 | 600 |
PolyDisH57 | 1 | 2 | 3000 |
PolyDisM57 | 1 | 2 | 3000 |
ProDom | 1 | 4 | 2604 |
Protein | 1 | 4 | 213 |
WoodyPlants50 | 1 | 14 | 791 |
Zongker | 1 | 10 | 2000 |
This dataset has been generated by the DisTools command genballd([100 100], 3, [0.02 0.04]) which generates the given numbers of 3-D balls with sizes [0.02 0.04] in a 3-D hypercube. Balls do not overlap. Dissimilarities are computed as the shortest distance between two points on the surface of two balls. The intention is to study strong examples in which non-Euclidean dissimilarities are informative.
Links: PRTools, DisTools toolbox, Balls3D dataset
This dataset has been generated by the DisTools command GENBALLD([500 500 500 500], 50, [0.01 0.02 0.04 0.08]) which generates the given numbers of 50-D balls with sizes [0.01 0.02 0.04 0.08] in a50-D hypercube. Balls do not overlap. Dissimilarities are computed as the shortest distance between two points on the surface of two balls. The intention is to study strong examples in which non-Euclidean dissimilarities are informative.
Links: PRTools, DisTools toolbox, Balls50D dataset
The cat-cortex data set is provided as a 65x65 dissimilarity matrix describing the connection strengths between 65 cortical areas of a cat from four regions (classes): auditory (A), frontolimbic (F), somatosensory (S) and visual (V). The data was collected by [Scannell] and used for classication [Graepel] and clustering [Denoeux and Masson]. The dissimilarity values are measured on an ordinal scale.
This is a dissimilarity matrix between set of graphs derived from four objects of the COIL database computed by Richard Wilson. The graphs are the Delaunay triangulations derived from corner points found in these images, see [Xia and Hancock]. Graphs are compared in the eigenspace with a dimensionality determined by the smallest graph in every pairwise comparison by the JoEig approach. see [Lee and Duin].
Links: The PRTools version of the data
This is a dissimilarity matrix between set of graphs derived from four objects of the COIL database computed by Richard Wilson. The graphs are the Delaunay triangulations derived from corner points found in these images, see [Xia and Hancock]. Distances are obtained in a pairwise fashion in a 5D space of eigenvectors derived from the two graphs by the JoEig approach, see [Lee and Duin].
Link: The PRTools version of the data
This is a dissimilarity matrix between set of graphs derived from four objects of the COIL database computed by Richard Wilson. The graphs are the Delaunay triangulations derived from corner points found in these images, see [Xia and Hancock]. The distance matrix is constructed by graph matching, using the algorithm of [Gold and Ranguranjan]
References
Link: The PRTools version of the data
This dataset consists of the dissimilarities computed from a set of gestures in a sign-language study. They are measured by two video cameras observing the positions the two hands in 75 repititions of creating 20 different signs. The dissimilarities result from a dynamic time warping procedure. The experiments are performed by Gineke ten Holt, Jeroen Arendsen Robbert Eggermont and Jeroen Lichtenauer who also prepared the dataset.
Reference
Jeroen Lichtenauer, Emile A. Hendriks, Marcel J. T. Reinders: Sign Language Recognition by Combining Statistical DTW and Independent Classification. IEEE Trans. Pattern Anal. Mach. Intell. vol. 30, 2040-2046, 2008.
Link: The PRTools version of the data
This dissimilarity dataset is based on 612 FL3-A DNA flowcytometer histograms from breast cancer tissues in 256 resolution. The initial data were acquired by M. Nap and N. van Rodijnen of the Atrium Medical Center in Heerlen, The Netherlands, during 2000-2004, using tubes 3, 4,5 and 6 of a DACO Galaxy flowcytometer. There are thereby 4 datasets. Histograms are labeled in 3 classes: aneuploid (335 patients), diploid (131) and tetraploid (146). Dissimilarities between normalized histograms are computed using the L1 norm, correcting for possible different calibration factors.
Links
The dissimilarity dataset consists of the L1 distances between all points of two 20-dimensional Gaussian distributed sets of 1000 points each. Variances in all directions for both sets are 1. The means of the two sets are equal, except for the first dimension, where they have a distance 1. The 20-dimensional set of points is generated by the PRTools command gendats([1000 1000],20).
Links
The dissimilarity dataset consists of the Minkowsky 0.2 distances between all points of two 20-dimensional Gaussian distributed sets of 1000 points each. Variances in all directions for both sets are 1. The means of the two sets are equal, except for the first dimension, where they have a distance 1. The 20-dimensional set of points is generated by the PRTools command gendats([1000,1000],20).
Links
This is a small part of the so-called 20Newsgroups data, as considered by Roweis. A nonmetric correlation measure for messages from four classes of newsgroups, .comp.*., .rec.*., .sci.*. and .talk.*. are computed on the occurrence for 100 words across 16242 postings.
Reference
E. Pekalska and R.P.W. Duin, The Dissimilarity Representation for Pattern Recognition, Foundations and Applications, World Scientific, Singapore, 2005.
Links
These are the Hausdorff distances between two randomly generated sets of polygons, pentagons and heptagons, both possibly non-convex. Means are made equal and scales are normalized before the distances are computed, but the polygons are not rotated.
References
Links
These are the modified Hausdorff distances between two randomly generated sets of polygons, pentagons and heptagons, both possibly non-convex. Means are made equal and scales are normalized before the distances are computed, but the polygons are not rotated.
References
Links
ProDom is a comprehensive set of protein domain families [Corpet]. A ProDom subset of 2604 protein domain sequences from the ProDom set was selected by [Roth]. These are chosen based on a high similarity to at least one sequence contained in the first four folds of the SCOP database. The pairwise structural alignments are computed [Roth]. Each SCOP sequence belongs to a group, as labeled by the experts [Murzin]. The same four classes are assigned here.
References
Links
The protein data are provided as a 213x213 dissimilarity matrix comparing the protein sequences based on the concept of an evolutionary distance. It was used for classification in [Graepel] and for clustering in [Denoeux and Masson]. There are four classes of globins: heterogeneous globin (G), hemoglobin-A (HA), hemoglobin-B (HB) and myoglobin (M).
References
Links
This dataset of shape dissimilarities between leaves is a small part of the data that is collected in a study on woody plants. This particular subset has been donated by David Jacobs of University College Maryland and consists out of examples of 14 species for which more than 50 leaves per class are avialable.
References
Links
These similarities between 2000 handwritten digits in 10 classes are based on deformable template matching. The dissimilarity measure is the result of an iterative optimization of the non-linear deformation of the grid, see the study by Jain and Zongker. The data has been made available by them to Pekalska who used it in slightly modified version (symmetrized dissimilarities) in several studies.
References
Link
There are 44 dissimilarity matrices made available by Bunke et.al. for the Chickenpieces binary images based on shape distances between the contours. Every entry in a dissimilarity matrix is a weighted edit distance between two strings representing the contours of 2D blobs. Contours are approximated by vectors of lengths 5, 7, 10, 15, 20, 25, 29, 30, 31, 35 and 40. Angles between vectors are used as replacement costs. The costs for insertion and deletion are for every of these 11 lengths taken as 45, 60, 90 and 120.
Reference
H. Bunke, H., U. Buhler, Applications of approximate string matching to 2D shape recognition, Pattern recognition 26 (1993) 1797-1812
Links
This Web Page Created with PageBreeze Free HTML Editor