K-means clustering of data sets with missing values using modified Euclidean distance / Emmylou H. Pulvera
Material type:![Text](/opac-tmpl/lib/famfamfam/BK.png)
Cover image | Item type | Current library | Collection | Call number | Status | Date due | Barcode |
---|---|---|---|---|---|---|---|
|
![]() |
University Library Theses | Room-Use Only | LG993.5 2005 A64 P84 (Browse shelf(Opens below)) | Not For Loan | 3UPML00011327 | |
|
![]() |
University Library Archives and Records | Preservation Copy | LG993.5 2005 A64 P84 (Browse shelf(Opens below)) | Not For Loan | 3UPML00022129 |
Browsing College of Science and Mathematics shelves, Shelving location: Theses, Collection: Room-Use Only Close shelf browser (Hides shelf browser)
Thesis (BS Applied Mathematics) -- University of the Philippines Mindanao, 2005
K-means clustering is the most extensively used in clustering algorithm in the field of data analysis. One major problem in data analysis is the occurrence of missing values. Mean imputation and case deletion can produce erroneous conclusions by introducing possibly unreliable estimates and significantly reducing the data set, respectively. To totally avoid these problems, the Euclidean distance function used in the allocation step was modified to compute distances between two vectors with some unknown step was modified to compute distances between two vectors with some unknown values. Representation, defined by the center of cluster, was also modified to compute means of each feature in a cluster even when one or more of the cases were incomplete. This, modification is an extension of the K-means clustering algorithm for handling missing values. For the evaluation of the method, different sets of data were simulated from the Iris Data Base to represent different types of missing values with different levels of degradation. The modified algorithm was compared to imputation and case deletion. Results showed that the modified algorithm has higher cluster recovery than imputation method while cluster recovery in case deletion was higher than that of the modified K-means. However, the latter was only true for data points left after deletion. Thus, the modified K-means has the advantage of avoiding losing information.
There are no comments on this title.