Dissimilarity Datasets for Pattern Recognition

On this page information and links are given for a number of  dissimilarity datasets collected by the Pattern Recognition Lab of Delft University of Technology. They are stored as Matlab's  mat-files and are in the dataset format of PRTools. Without PRTools it is a Matlab structure d with the data in d.data, a label index in d.nlab pointing to class names in d.lablist. A number of these datasets are collected with support of the  FET programme within the EU FP7, under the SIMBAD project (contract 213250).

Name # matrices # classes # objects
Balls3D 1 2 200
Balls50D 1 4 2000
CatCortex 1 4 65
Chickenpieces 44 5 446
CoilDelftDiff, 1 4 288
CoilDelftSame 1 4 288
CoilYork 1 4 288
DelftGestures 1 20 1500
FlowCyto 4 3 612
GaussM1 1 2 2000
GaussM02 1 2 2000
NewsGroups 1 4 600
PolyDisH57 1 2 3000
PolyDisM57 1 2 3000
ProDom 1 4 2604
Protein 1 4 213
WoodyPlants50 1 14 791
Zongker 1 10 2000

 

 

 

 

 

 

 

 

 

 

 

 


 

 

 

Balls3D

This dataset has been generated by the DisTools command genballd([100 100], 3, [0.02 0.04]) which generates the given numbers of 3-D balls with sizes [0.02 0.04] in a 3-D hypercube. Balls do not overlap. Dissimilarities are computed as the shortest distance between two points on the surface of two balls. The intention is to study strong examples in which non-Euclidean dissimilarities are informative.

References

Links:  PRTools, DisTools toolbox, Balls3D dataset

Balls50D

This dataset has been generated by the DisTools command GENBALLD([500 500 500 500], 50, [0.01 0.02 0.04 0.08]) which generates the given numbers of 50-D balls with sizes [0.01 0.02 0.04 0.08] in a50-D hypercube. Balls do not overlap. Dissimilarities are computed as the shortest distance between two points on the surface of two balls. The intention is to study strong examples in which non-Euclidean dissimilarities are informative.

References

Links:  PRTools, DisTools toolbox, Balls50D dataset

CatCortex

The cat-cortex data set is provided as a 65x65 dissimilarity matrix describing the connection strengths between 65 cortical areas of a cat from four regions (classes): auditory (A), frontolimbic (F), somatosensory (S) and visual (V). The data was collected by [Scannell] and used for classication [Graepel] and clustering [Denoeux and Masson]. The dissimilarity values are measured on an ordinal scale.

References

Links

CoilDelftDiff

This is a dissimilarity matrix between set of graphs derived from four objects of the COIL database computed by Richard Wilson. The graphs are the Delaunay triangulations derived from corner points found in these images, see [Xia and Hancock]. Graphs are compared in the eigenspace with a dimensionality determined by the smallest graph in every pairwise comparison by the JoEig approach. see [Lee and Duin].

References

Links: The PRTools version of the data

CoilDelftSame

This is a dissimilarity matrix between set of graphs derived from four objects of the COIL database computed by Richard Wilson. The graphs are the Delaunay triangulations derived from corner points found in these images, see [Xia and Hancock]. Distances are obtained in a pairwise fashion in a 5D space of eigenvectors derived from the two graphs by the JoEig approach, see [Lee and Duin].

References

Link: The PRTools version of the data

CoilYork

This is a dissimilarity matrix between set of graphs derived from four objects of the COIL database computed by Richard Wilson. The graphs are the Delaunay triangulations derived from corner points found in these images, see [Xia and Hancock]. The distance matrix is constructed by graph matching, using the algorithm of [Gold and Ranguranjan]

References

Link: The PRTools version of the data

DelftGestures

This dataset consists of the dissimilarities computed from a set of gestures in a sign-language study. They are measured by two video cameras observing the positions the two hands in 75 repititions of creating 20 different signs. The dissimilarities result from a dynamic time warping procedure. The experiments are performed by Gineke ten Holt, Jeroen Arendsen Robbert Eggermont and Jeroen Lichtenauer who also prepared the dataset.

Reference

Jeroen Lichtenauer, Emile A. Hendriks, Marcel J. T. Reinders: Sign Language Recognition by Combining Statistical DTW and Independent Classification. IEEE Trans. Pattern Anal.  Mach. Intell. vol. 30, 2040-2046, 2008.

Link: The PRTools version of the data

FlowCyto

This dissimilarity dataset is based on 612 FL3-A DNA flowcytometer histograms from breast cancer tissues in 256 resolution. The initial data were acquired by M. Nap and N.  van Rodijnen of the Atrium Medical Center in Heerlen, The Netherlands, during 2000-2004, using tubes 3, 4,5 and 6 of a DACO Galaxy flowcytometer. There are thereby 4 datasets. Histograms are labeled in 3 classes: aneuploid (335 patients), diploid (131) and tetraploid (146). Dissimilarities between normalized histograms are computed using the L1 norm, correcting for possible different calibration factors.

Links

GaussM1

The dissimilarity dataset consists of the L1 distances between all points of   two 20-dimensional Gaussian distributed sets of 1000 points each. Variances in  all directions for both sets are 1. The means of the two sets are equal, except for the first dimension, where they have a distance 1.   The  20-dimensional set of points is generated by the PRTools command gendats([1000 1000],20).

Links

GaussM02

The dissimilarity dataset consists of the Minkowsky 0.2 distances between all  points of two 20-dimensional Gaussian distributed sets of 1000 points each.  Variances in all directions for both sets are 1. The means of the two sets are equal, except for the first dimension, where they have a distance 1.  The  20-dimensional set of points is generated by the PRTools command gendats([1000,1000],20).

Links

NewsGroups

This is a small part of the so-called 20Newsgroups data, as considered by Roweis. A nonmetric correlation measure for messages from four classes of newsgroups, .comp.*., .rec.*., .sci.*. and .talk.*. are computed on the occurrence for 100 words across 16242 postings.

Reference

E. Pekalska and R.P.W. Duin, The Dissimilarity Representation for Pattern Recognition, Foundations and Applications, World Scientific, Singapore, 2005.

Links

PolyDisH57

These are the Hausdorff distances between two randomly generated sets of polygons, pentagons and heptagons, both possibly non-convex. Means are made equal and scales are normalized before the distances are computed, but the polygons are not rotated.

References

Links

PolyDisM57

These are the modified Hausdorff distances between two randomly generated sets of polygons, pentagons and heptagons, both possibly non-convex. Means are made equal and scales are normalized before the distances are computed, but the polygons are not rotated.

References

Links

ProDom

ProDom is a comprehensive set of protein domain families [Corpet]. A ProDom subset of 2604 protein domain sequences from the ProDom set was selected by [Roth]. These are chosen based on a high similarity to at least one sequence contained in the first four folds of the SCOP database. The pairwise structural alignments are computed [Roth]. Each SCOP sequence belongs to a group, as labeled by the experts [Murzin]. The same four classes are assigned here.

References

Links

The PRTools version of ProDom

Protein

The protein data are provided as a 213x213 dissimilarity matrix comparing the protein sequences based on the concept of an evolutionary distance. It was used for classification in [Graepel] and for clustering in [Denoeux and Masson]. There are four classes of globins: heterogeneous globin (G), hemoglobin-A (HA), hemoglobin-B (HB) and myoglobin (M).

References

Links

WoodyPlants50

This dataset of shape dissimilarities between leaves is a small part of the data that is collected in a study on woody plants. This particular subset has been donated by David Jacobs of University College Maryland and consists out of examples of 14 species for which more than 50 leaves per class are avialable.

References

Links

Zongker

These similarities between 2000 handwritten digits in 10 classes are based on deformable template matching. The dissimilarity measure is the result of an iterative optimization of the non-linear deformation of the grid, see the study by Jain and Zongker. The data has been made available by them to Pekalska who used it in slightly modified version (symmetrized dissimilarities) in several studies.

References

Link

The PRTools version of the dataset

Chickenpieces

There are 44 dissimilarity matrices made available by Bunke et.al. for the Chickenpieces binary images based on shape distances between the contours.  Every entry in a dissimilarity matrix is a weighted edit distance between two strings representing the contours of 2D blobs. Contours are approximated by vectors of lengths 5, 7, 10, 15, 20, 25, 29, 30, 31, 35 and 40. Angles between vectors are used as replacement costs. The costs for insertion and deletion are for every of these 11 lengths taken as 45, 60, 90 and 120.

Reference

H. Bunke, H., U. Buhler, Applications of approximate string matching to 2D shape recognition, Pattern recognition 26 (1993) 1797-1812

Links

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This Web Page Created with PageBreeze Free HTML Editor