|
|
Blood, Vol. 106, Issue 2, 681-689, July 15, 2005

Distinctive gene expression pattern in VH3-21 utilizing B-cell chronic lymphocytic leukemia
Blood Fält et al.
106: 681
Supplemental materials for: Fält et al, Vol 106, Issue 2, 681-689
Supplementary information
Data analysis
We have used Affymetrix oligonucleotide arrays U95Av2, complementary to more than 12 500 sequences, to analyze the expression profiles in different sample subgroups (ie, classes) of B-cell chronic lymphocytic leukemia (B-CLL), including Ig-unmutated, Ig-mutated, and VH3-21+ B-CLL.
Quality control
Gene expression studies were performed on total RNA. An average of ~36% (4542 ± 1374) of transcripts were scored as present in the non-VH3-21 samples, while in the VH3-21+ cases an average of ~33% (4145 ± 2050) of transcripts were present. To assess whether any of the samples in the study were outliers that could skew the expression result, a multidimensional scale-plot (Manhattan plot) was constructed in the R software (www.bioconductor.org) for exploring quality and relations among samples. The plots were performed with quantile normalized data (with "affy-package" in R) and data were normalized in Affymetrix MicroArraySuite 5.0. No extreme outliers were found, and all samples were included in the analysis.
Marker analysis
Marker analysis was performed using GeneCluster 2.01 to identify genes correlated with particular sample class distinctions (comparison 1: VH3-21+ and non–VH3-21 B-CLL; comparison 2: Ig-unmutated, Ig-mutated, and VH3-21+ B-CLL). First, the gene expression data were subjected to a variation filter that excluded genes showing minimal variation across the samples. We used the default settings for the filtering procedure as follows: genes were excluded if they exhibited less than 3-fold (max/min) and 100 units (max – min) absolute variation across the data set after a threshold of 20 units and a ceiling of 16,000 units were applied. The ceiling of 16,000 was chosen because that is the level at which saturation of the scanner is observed; values above 16,000 cannot be measured reliably. The threshold of 20 units was set so as to avoid missing any potentially informative marker genes. The data set was normalized by standardizing each row (probe) to mean = 0 and variance = 1. To compare neighbors in the marker analysis, a class template, assigning the class belongings, was given. The number of markers to be considered for correlation is chosen by the user, and 50 or 1000 markers for each class were used for this analysis. For gene ranking, we used the signal-to-noise (S2N) method, which identifies the difference of means in each of the classes scaled by the sum of the standard deviations: (µ1 + µ2)/(σ1 + σ2), where µ1 is the mean of class 1 and σ1 is the standard deviation of class 1. The marker gene most correlated to a single class will receive the best S2N score. To decide how many of the marker genes should be considered for further study, we performed a permutation testing: 500 random permutations of the class labels were generated, and the S2N ratio was recalculated for each gene for each class label permutation. A gene is considered a statistically significant sample-class–specific marker if the observed S2N exceeds the permuted S2N at least 99% or 95% of the time (P ≤ .01 and P ≤ .05, respectively).2
Supervised learning and weighted voting
GeneCluster 2.0 was used to build a gene model with a set of genes that can be used together to distinguish between samples that belong to different sample classes. Preprocessing and filtering settings were the same as for the marker analysis described above. A set of features or genes ranging from 1 to 100 were chosen to test and build a class discriminator, and the signal-to-noise statistics were used for feature selection. The classifier algorithm "weighted voting" was used to build the class discriminator. With the weighted voting classification, each "informative gene" input is given a vote for class A or class B, depending on whether the gene’s expression level in the sample is closer to µA or µB. The votes for each class are summarized to obtain total votes for class A and class B. The sample is assigned to the class with the higher vote total. To test the classifier, we then used a "leave-one-out" cross-validation; 1 of the 39 samples was withheld and the remaining 38 samples were used to train a model and predict the class of the withheld sample. This process was repeated for all 39 samples. Fisher tests were performed to evaluate the significance of each class discriminator.
Supervised learning: linear discriminant analysis (LDA)
LDA is an established approach for classifying samples of unknown classes, based on training samples with known classes. This method seeks the linear combination of variables that maximizes the ratio of between-group variance and within-group variance by using the grouping information. The linear weights used by LDA depend on how a gene separates in the two groups and how this gene correlates with the other genes.3 For comparison analysis and LDA, we used DNAChip software.4 First we selected the Baseline (B) group and the Experiment (E) group, then we calculated the mean expression level of samples in the groups. When a group contains multiple samples, the group mean and standard error (SE) are computed by pooling arrays considering measurement accuracy, where samples in the same group are regarded as "replicates." The first filtering criteria require that the fold change between the group means exceed a specified threshold; fold change above 1.2 was used for both B/E and E/B. The "Use lower 90% confidence bound" was also set to specify to use the lower confidence bound of fold changes, calculated using the group means and their SEs. The second filtering criteria are the absolute differences between group means, B-E and E-B; here, absolute differences greater than 100 were used.
References
1. Tamayo P, Slonim D, Mesirov J, et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A. 1999;96:2907-2912.
2. Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531-537.
3. Balakrishnama S, Ganapathiraju A. Linear Discriminant Analysis: A brief tutorial. http://www.isip.msstate.edu/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theorypdf.
4. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A. 2001;98:31-36.
Files in this Data Supplement:
|
|