K-means clustering of data sets with missing values using modified Euclidean distance / Emmylou H. Pulvera

By:

Pulvera, Emmylou H

Material type: Text

TextLanguage: English Publication details: 2005Description: 63 leavesSubject(s):

Undergraduate Thesis AMAT200

Dissertation note: Thesis (BS Applied Mathematics) -- University of the Philippines Mindanao, 2005 Abstract: K-means clustering is the most extensively used in clustering algorithm in the field of data analysis. One major problem in data analysis is the occurrence of missing values. Mean imputation and case deletion can produce erroneous conclusions by introducing possibly unreliable estimates and significantly reducing the data set, respectively. To totally avoid these problems, the Euclidean distance function used in the allocation step was modified to compute distances between two vectors with some unknown step was modified to compute distances between two vectors with some unknown values. Representation, defined by the center of cluster, was also modified to compute means of each feature in a cluster even when one or more of the cases were incomplete. This, modification is an extension of the K-means clustering algorithm for handling missing values. For the evaluation of the method, different sets of data were simulated from the Iris Data Base to represent different types of missing values with different levels of degradation. The modified algorithm was compared to imputation and case deletion. Results showed that the modified algorithm has higher cluster recovery than imputation method while cluster recovery in case deletion was higher than that of the modified K-means. However, the latter was only true for data points left after deletion. Thus, the modified K-means has the advantage of avoiding losing information.

List(s) this item appears in: BS Applied Mathematics

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings ( 2 )
Title notes ( 2 )
Comments ( 0 )
Images

Holdings
Cover image	Item type	Current library	Collection	Call number	Status	Date due	Barcode
	Thesis	University Library Theses	Room-Use Only	LG993.5 2005 A64 P84 (Browse shelf(Opens below))	Not For Loan		3UPML00011327
	Thesis	University Library Archives and Records	Preservation Copy	LG993.5 2005 A64 P84 (Browse shelf(Opens below))	Not For Loan		3UPML00022129

Browsing College of Science and Mathematics shelves, Shelving location: Theses, Collection: Room-Use Only Close shelf browser (Hides shelf browser)

Previous	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	No cover image available	Next
Previous	LG993.5 2005 A64 M35 A modified K-means algorithm for clustering data sets with missing values using adaptive imputation /	LG993.5 2005 A64 M37 Optimization of coconut biodiesel and glycerol yield at minimum production time : a goal programming approach /	LG993.5 2005 A64 N47 Optimal two-crop succession under limited resource conditions : a goal programming model on small holder self-financed vegetable farmers in Kapatagan, Digos, Davao del Sur /	LG993.5 2005 A64 P84 K-means clustering of data sets with missing values using modified Euclidean distance /	LG993.5 2005 A64 R37 Clustering the morphological characteristics of sago palm (Metroxylon sagu rottb.) /	LG993.5 2005 A64 S48 Contour map of the dissolved oxygen concentration near the property coastline of the Franklin Baker Company (using Ordinary Kriging Interpolation) /	LG993.5 2005 A64 T52 Response surface modeling for biodiesel optimization in one-stage and two-stage transterification at 32°C and 65°C /	Next

Thesis (BS Applied Mathematics) -- University of the Philippines Mindanao, 2005

K-means clustering is the most extensively used in clustering algorithm in the field of data analysis. One major problem in data analysis is the occurrence of missing values. Mean imputation and case deletion can produce erroneous conclusions by introducing possibly unreliable estimates and significantly reducing the data set, respectively. To totally avoid these problems, the Euclidean distance function used in the allocation step was modified to compute distances between two vectors with some unknown step was modified to compute distances between two vectors with some unknown values. Representation, defined by the center of cluster, was also modified to compute means of each feature in a cluster even when one or more of the cases were incomplete. This, modification is an extension of the K-means clustering algorithm for handling missing values. For the evaluation of the method, different sets of data were simulated from the Iris Data Base to represent different types of missing values with different levels of degradation. The modified algorithm was compared to imputation and case deletion. Results showed that the modified algorithm has higher cluster recovery than imputation method while cluster recovery in case deletion was higher than that of the modified K-means. However, the latter was only true for data points left after deletion. Thus, the modified K-means has the advantage of avoiding losing information.

There are no comments on this title.

to post a comment.

Click on an image to view it in the image viewer