Blood online
Home About Blood Authors Subscriptions Permission Advertising Public Access contact us
 

 
Advanced
Current Issue
First Edition
Future Articles
Archives
Submit to Blood
Search
American Society of Hematology
Meeting Abstracts
Email Alerts

Blood, Vol. 109, Issue 11, 4952-4963, June 1, 2007
This Article
Right arrow Abstract
Right arrow Full Text
Services
Right arrow Email this article to a friend
Right arrow Alert me to new issues of the journal
Right arrow reprints & permissions
Right arrow Rights and Permissions
Citing Articles
Right arrow Citing Articles via CrossRef

The gene expression profile of nodal peripheral T-cell lymphoma demonstrates a molecular link between angioimmunoblastic T-cell lymphoma (AITL) and follicular helper T (TFH) cells
Blood de Leval et al. 109: 4952

Supplemental materials for: de Leval et al, Vol 109, Issue 11, 4952-4963

Files in this Data Supplement:

  • Figure S1. Unsupervised clustering of 16 PCTL-U and 17 AITL tissue samples based on the top 2.5% most varying probe sets (JPG, 111 KB) -
    (A) Classification of the 33 tumor tissue samples using unsupervised gene selection of the top 2.5% most varying probe sets. Probe set selection was based on the 2 following criteria: (1) a P value of a variance test (below) less than .01, and (2) a “robust” coefficient of variation (rCV; calculated by dividing the standard deviation by the mean, eliminating the highest and lowest expression value across the samples for each probe set) less than 10 and superior to a given rCV percentile. The 97.5% rCV percentile threshold was used yielding the top 2.5% (644) most varying probe sets (Table S1). Hierarchic clustering of the 33 samples was performed based on these 644 probe sets using complete linkage and 1-Pearson correlation as a distance metric (package cluster V1.9.3). Variance test: For each probe set (P), we tested whether its variance across samples was different from the median of the variances of the 50 406 probe sets. The statistic used was (n − 1 × Var(P)/Varmed), where n refers to the number of samples. This statistic was compared to a percentile of the Chi-square distribution with (n – 1) degrees of freedom (this criterion is used in the BRB ArrayTools filtering tool, described in the User’s Manual) and yielded a P value for each probe set. Shown are the sample and gene dendrograms. For CD30, + denotes expression in > 50% of neoplastic cells; – denotes absence of expression or low level of expression of CD30 in neoplastic cells. The percentage of tumor cells is represented by a gray scale (light gray, 30%-50% tumor cells; medium gray, 50%-70% tumor cells; dark gray, > 70% tumor cells). The samples were divided into 3 groups that were related to AITL, CD30+ PTCL-u, or CD30 PTCL-u characterization (C1′ composed mainly of AITLs (P < .0001), C2′ composed of CD30– PTCL-u tumors (P = .005) and C3′ composed mainly of CD30+ PTCL-u tumors (P = .002). These 3 groups were robust in the face of data perturbation (stability score of 0.91 using Gaussian noise and 0.88 using permutations). The gene dendrogram was cut into 6 clusters (a, b, c, d, e, and f) of genes, subsequently analyzed for GO term enrichment (Table S1). A few noteworthy genes are given to the right of each gene cluster. A gene cluster comprising 115 individual genes (cluster b) was generally overexpressed in AITL versus both groups of PTCL-u samples, and a gene cluster of 56 genes (cluster e) was overexpressed in CD30+ PTCL-u samples compared to the other 2 groups of samples. (B) To assess the intrinsic robustness of the obtained partition, we calculated a reproducibility score based on partitions obtained after data perturbation (addition of random Gaussian noise µ = 0, σ = 1.5 × median variance calculated from the data set to the data matrix) or resampling (random substitution of 5% of the samples by virtual samples, generated by random crossing/combinations of existing profiles). To compare 2 dendrograms, both were cut into k clusters (k = 2..18) yielding a “partition” for each dendrogram. The 2 partitions for each k were compared, and a reproducibility score was obtained by calculation of the proportion of pairs of samples that are in the same cluster retained in both partitions. The overall stability score of the classification was assessed by calculating a mean reproducibility score (for different values of k), using all pairs of dendrograms. Shown here are the stability scores calculated for a given partition (k = 2…7) using either the introduction of Gaussian noise (left) or resampling (right). Averaged scores after 100 iterations are shown.

  • Table S1. Top 2.5% most varying probe sets (XLS, 196 KB) -
    The first sheet (7 sheets total) in the Excel file is the list of 644 probe sets used to generate the sample dendrogram and the heatmap shown in Figure S1 (order of the genes respects the order as the genes are shown in the figure). For each gene, we are providing the associated annotations (gene symbol, gene title, Unigene and Entrez Gene identifiers) as well as the cluster group delineated in the figure for the 6 gene cluster groups (a-f). Each subsequent sheet corresponds to the summary of the results obtained from the GO analyses that were performed for each cluster group of genes. Shown are the GO terms that were found to be significant (P < .01) and represented by a minimal number of 3 genes in the respective gene lists. We used the December 2005 mapping of probe set identifiers to Entrez Gene identifiers provided by Affymetrix (http://www.affymetrix.com/support/technical) in combination with the hgu133pl2 metadata library (January 12, 2005; http://www.bioconductor.org) to map Entrez Gene identifiers to GO terms. For the summary of the GO analysis, we have supplied the GO identifiers, the number of total genes represented in the given gene list (intCounts), the gene names and probe set identifiers in the given list for that GO term, the number of the genes (goCounts) on the HG-U133plus2.0 Affymetrix array that were attributed to that GO term, and the P value resulting from the hypergeometric test. For each GO term, we also provide its corresponding GO category (gocat), which is coded as follows: biologic process (BP), molecular function (MF), and cellular compartment (CC).

  • Table S2. Comparison of the gene expression profile of the 33 tumor tissue samples (17 AITLs and 16 PTCLs-u) according to the AITL versus PTCL-u distinction (XLS, 256 KB) -
    Comparison of the gene expression profile of the 33 tumor tissue samples (17 AITLs and 16 PTCLs-u) according to the AITL versus PTCL-u distinction. The first sheet (4 sheets total) in the Excel file is the list of 832 probe sets that were found to be significantly differentially expressed (P < .002; max FDR = 10.6%) between AITL and PTCL-u with associated annotations (gene symbol and gene title), the geometric mean value for each group, the fold change in the AITL group relative to the PTCL-u group, and the P value derived from the t test. For the 545 probe sets overexpressed in AITL that had a AITL/PTCL-u fold change (FC) superior to 1.0, we have appended the geometric mean of the 2 sorted AITL tumor cell samples and the AITL cell/tissue FC. Each subsequent sheet corresponds to the summary of the results obtained from the GO and pathway analyses for all of the genes, and the genes stratified by relative expression in AITL compared to PTCL-u (“up” or “down”) and the specific gene lists. Shown are the GO terms and KEGG pathways that were found to be significant (P < .01) and represented by a minimal number of 3 genes in the respective gene lists. For the summary of the GO analysis, we have supplied the same information described in the legend for Table S1. For the pathway analysis, we show the KEGG pathway meeting the above criteria, the number of genes in the list for that pathway, and the P value derived from the Fisher exact test that has been adjusted with a Bonferroni multiple testing correction.

  • Table S3. Gene-set enrichment analysis (GSEA) resuts (XLS, 25.5 KB) -
    Part A of this table summarizes the GSEA results obtained from the 14 different gene sets (GSs) presented in Table 1, considering all samples, and part B summarizes the GSEA results obtained from GS1-GS5 after exclusion of PTCL samples S24, S25, S26, and S28 from the analysis. The following comparisons were analyzed: AITL versus PTCL-u, CD30+ PTCL-u versus CD30 PTCL-u, and AITL versus CD30 PTCL-u. For the latter comparison, only 10 AITLs were included in the test in condition B. Two different statistics were used to rank the genes: signal-to-noise (SNR) or a classical t test. Shown are number of genes (size), the raw and normalized enrichment scores (ES and NES, respectively), the P values obtained using the SNR statistic to rank the genes, and the group of samples to which the given signature was attributed. Gene sets that were significant using the SNR for a given group were also found to yield roughly equivalent P values using t test statistic to rank the genes (data not shown). Yellow indicates gene sets that yielded a P value < .05.

  • Table S4. Comparison of the gene-expression profile of 15 PTCL-u samples (6 CD30+ and 9 CD30–) according to the CD30+ versus CD30– distinction (XLS, 71.5 KB) -
    Comparison of the gene expression profile of 15 PTCL-u samples (6 CD30+ and 9 CD30) according to the CD30+ versus CD30 distinction. The first sheet (7 sheets total) in the Excel file is the list of 241 probe sets that were found to be significantly differentially expressed (P < .002; max FDR = 24.8 %) between CD30 and CD30+ PTCLs-u with associated annotations (gene symbol and gene title), the geometric mean value for each group, the fold change in the CD30 group relative to the CD30+ PTCL-u group, and the P value derived from the t test. Each subsequent sheet corresponds to the summary of the results obtained from the GO and pathway analyses for all of the genes, and the genes stratified by relative expression in CD30 compared to CD30+ PTCL-u (“up” or “down”) and the specific gene lists.

  • Table S5. Prediction analysis based on a training group (XLS, 24.5 KB) -
    Prediction analysis based upon a training group. Part A shows the genes in the top 4 predictors. For the initial gene selection univariate t tests using BRB ArrayTools (v3.5 beta1) of the S1 group of samples yielded 284 significant probe sets (P < .001 random variance model was used and a geometric mean intensity value higher than 10 in at least for either the 8 AITL or the 8 PTCL-u samples). For building the multigene (1-10 genes) predictors and classification of the S2 group, R packages were used and are indicated below. For the bottom-up step approach, each of the 284 genes was individually used as a predictor, and the success of classification of the S1 was calculated. The gene yielding the highest success rate of correct classification was chosen (if more than one gene yielded the same result, the first in the list was chosen). All 2-gene combinations including the chosen gene and each of the remaining genes were then used to construct a 2-gene predictor. The 2-gene predictor yielding the highest success rate was chosen and so on until the best 10-gene predictor was identified (a total of 2795 gene combinations were tested on the S1 group). Any of the resulting 2…10-gene predictors having a success rate lower than 80% were eliminated. The remaining predictors were then applied to the S2 group. Shown are the success rate of S2 prediction, prediction algorithm for this classifier (pred-algorithm), probe set identifier, gene symbol, gene title, geometric means for both groups of samples (Geom mean), ratio or fold change of geometric means (FC AITL/PTCL-u), parametric P value, and false discovery rate based on 1000 permutations (FDR). Part B shows the results of the classification of the S2 by each classifier where diagnosis represents true class membership. Cases misclassified by the predictors are highlighted in light blue.

  • Table S6. Prediction analysis based on leave-one-out cross validation (LOOCV) approach (XLS, 68.5 KB) -
    The first sheet (2 sheets total) is the list of 220 probe sets that were used for the class prediction. Provided are the Probe Set ID, gene symbol, gene title, rank t value (ranking of the probe sets based on the descending order of the t statistic), parametric P value, t value (t statistic), percent cross-validation support (% CV support, which is the percent of samples such that the gene remains significant even if the sample is removed from the analysis and therefore characterizes number of times when this gene is used for classification in the leave-one-out cross-validation procedure), geometric means for both groups of samples (Geom mean), and ratio of geometric means. The second sheet is a table showing the results from the LOOCV analysis. “YES” and “NO” indicate a sample that was correctly classified and misclassified, respectively, by the indicated algorithm. All methods for this approach are detailed at ftp://linus.nci.nih.gov/pub/techreport/Manual_v3_5_beta1.pdf. Briefly, all of the samples minus 1 are used to select the genes (P < .001 random variance model was used and fold change > 2.0 or < 0.5) and build a predictor using a prediction algorithm and class the remaining sample. The procedure is repeated until all samples are left out and classified by the predictor (ie, 33 times). The average success rate for each prediction algorithm (compound covariate CC, diagonal linear discriminant analysis DLDA, 1-nearest neighbor KNN, 3-nearest neighbors KNN3, nearest centroid NC, support vector machines SVMs, or Bayesian compound covariate BCC) is presented used as the overall success based on the indicated number of genes and includes the results of the classification for each of the 33 predictors. For the BCC, after selecting the differentially expressed genes for distinguishing 2 classes in a cross-validated training set, the compound covariate is computed, which is the weighted average of the log expression values of the selected genes, with the weights being the t statistics of differential expression in that training set. After computing the posterior probability of the class of an omitted observation, we assign the observation to the class with the larger posterior probability. Equal class prevalence is used in the Bayesian compound covariate predictor. Based on 100 random permutations, the CC predictor has P value < .01, the DLDA classifier has P value < .01 , the KNN classifier has a P value < .01, the KNN3 classifier has P value < .01, the NC classifier has P value of .02, the SVM classifier has P value < .01, and the BCC classifier has a P value < .01. Predicted probability of each sample belonging to the class AITL during cross-validation from the BCC is given to the right of the prediction results.





This Article
Right arrow Abstract
Right arrow Full Text
Services
Right arrow Email this article to a friend
Right arrow Alert me to new issues of the journal
Right arrow reprints & permissions
Right arrow Rights and Permissions
Citing Articles
Right arrow Citing Articles via CrossRef

 click for free articles
home about blood authors subscriptions permissions advertising public access contact us
  Copyright © 2009 by American Society of Hematology         Online ISSN: 1528-0020