Gene expression data analysis for identifying crucial gene markers and subtype classification in breast cancer
Authors
Date
2012
Type
Thesis
Fields of Research
Abstract
Breast cancer is one of the leading causes of death in women. Even with advances in early-stage breast cancer treatment, physicians are still lacking the ability to precisely predict which patients would benefit from adjuvant chemotherapy. Gene expression profiling studies have been used to provide us with insights into the heterogeneity of breast cancer. Therefore, patterns of gene activity can be identified by genome-wide measures of gene expression to subclassify tumours. This might provide a better means, than is currently available, of treating patients with breast cancer, and to help physicians find the accurate treatment (Wang, Klijn et al. 2005). This study uses a combination of three different clustering methods: Hierarchical clustering, Self-organizing maps (SOM) and Ward method to further explore and validate the characters of previously identified, novel, 306 intrinsic genes thought to discriminate five types of breast cancer (LumA, LumB, Normal-like, Basal-like and Her2). It is also used to derive improved cluster characterisations for accurate subtype identification from independent gene expression data analyses. Implementation of these methods, in widespread clinical practice at present, remains limited. From an exploratory pilot study in this research, it was found that one or more of the few most highly active genes in one subtype can be active in one or more other subtypes indicating that several gene markers are essential for subtype discrimination. Nevertheless, this study identified one or two potential genes for some subtypes that may be useful as markers in their identification. In the main part of the investigation, the originally selected whole gene set was assessed for their efficacy in subtype discrimination using Hierarchical clustering and SOM, both in conjunction with Ward method that indicates the optimum number of clusters (subtypes). Hierarchical-Ward method found 6 clusters and SOM-Ward method found 7 clusters as optimum compared to the 5 subtypes reported by the original authors from whose work the gene set used in this study was extracted. This indicated the heterogeneity of subtypes. In both methods, second optimum number of clusters was 2. Our clusters revealed interesting results: for example, closer examination of the 2 cluster structure from both Herarchical-Ward and SOM-Ward indicated that 3 subtypes (LumA, LumB and Normal-like) always cluster together and the other 2 subtypes (Basal and Her2) make up the second cluster.
The six and 7 optimum clusters from the two respective methods did indicate that most clusters contain patients from more than one subtype and revealed which subtypes are more likely to cluster together. These results indicate subtype overlap. Although not featured highly in SOM-Ward results, the 6 cluster format from this method was explored to compare with Herarchical-Ward results and outcomes from the two methods were identical. This gives validity to the results in the study. Interestingly, two out of the 7 clusters from SOM consisted of only one subtype each (LumA or Basal-like), and 1 out 6 clusters from Heirarchical-Ward also contained only one subtype (LumA); but these clusters did not contain all of the patients originally thought to be belonging to each particular subtype. However, these clusters containing only one subtype each may indicate the core behaviour of the respective subtype and is worth exploring further. The results overall points to the complexity of discriminating the subtypes due to their heterogeneity and overlap, when viewed through the selected set of 306 intrinsic genes; this study has shed light on these characteristics in a reliable and predictable way.