Multivariate Data Analysis of Radiotherapy Data

project
Author

Samya

Published

April 15, 2023

About The Project

This was a (joint) course project in the Multivariate course alongside Debepsita Mukherjee and Rupayan Mandal. It was supervised by Dr. Swagata Nandi. In this project we have analyzed radiotherapy data. The set contains data on people undergoing radiotherapy. Variables like number of symptoms, average appetite and average sleep (etc) throughout their treatment were considered. The dataset contained a categorical variable for skin reaction. The analysis includes EDA, normality checking and transformations, MANOVA, LDA, PCA and factor analysis. By studying the correlation structure of the data and Kendall’s tau test we conclude that one of the variables is independent from others and consequently we drop that from our analysis. We checked whether the observations in different categories belong to the same population. From EDA and MANOVA we conclude that the observations indeed come from the same population. Consequently, discriminant analysis is not meaningful. This fact is supported by APER calculated using one-out cross validation method. From factor analysis we conclude that one variable is significantly different compared to others. Also among the other four, there are two factors affecting the variables. One factor is affecting appetite and food consumed. This factor could be due to the treatment’s effect on mouth/digestive system. The other is affecting the number of symptoms and sleep.

This blog mostly consists of plots and outputs with as very little explanations. For a complete report with proper explanations kindly go through our project report which is available on my GitHub page.

INTRODUCTION

In this project we are interested to check whether we can identify different factors influencing different aspects like appetite and sleep of a patient throughout their treatment. We are also interested in checking whether we can classify different patients based on their skin reaction ( to the treatment).

Radiotherapy and its side effects

Radiation therapy or radiotherapy is a therapy using ionizing radiation, generally provided as part of cancer treatment to control or kill malignant cells. Radiation therapy is in itself painless. It causes minimal or no side effects, although short-term pain flare-up can be experienced in the days following treatment. Side effects from radiation are usually limited to the area of the patient’s body that is under treatment. Some of the potential side effects are as follows :

1. Nausea and vomiting.

2. (Skin Reaction) Epithelial surfaces may sustain damage from radiation therapy.This may include the skin, oral mucosa, etc.

3. If the head and neck area is treated, temporary soreness and ulceration commonly occur in the mouth and throat. If severe, this can affect swallowing

4. Cognitive decline.

In this project we are interested to check whether we can identify different factors influencing different aspects like appetite and sleep of a patient throughout their treatment. We are also interested in checking whether we can classify different patients based on their skin reaction ( to the treatment).

Data Description :

Average ratings over the course of treatment for patients undergoing radiotherapy.

1. Col. 1: (V1) = number of symptoms

2. Col. 2: (V2) = amount of activity (1-5 scale)

3. Col. 3: (V3) = amount of sleep (1-5 scale)

4. Col. 4: (V4) = amount of food consumed (1-3 scale)

5. Col. 5: (V5) = appetite (1-5 scale) )

6. Col. 6: (V6) = skin reaction (0, 1, 2 or 3)

As they have taken average of ratings, we have assumed that V_{1},..,V_{5} are continuous variables.

In this report Vi or V_{i} will be interchangeably used for the following (i=1,..,6) :

• V1 : AVERAGE NUMBER OF SYMPTOMS THROUGHOUT THE TREATMENT

• V2: AVERAGE AMOUNT OF ACTIVITY THROUGHOUT THE TREATMENT

• V3: AVERAGE AMOUNT OF SLEEP THROUGHOUT THE TREATMENT

• V4: AVERAGE AMOUNT OF FOOD CONSUMED THROUGHOUT THE TREATMENT

• V5 : V2: AVERAGE AMOUNT OF APPETITE THROUGHOUT THE TREATMENT

• V2: SKIN REACTION AFTER THE TREATMENT

Exploratory Data Analysis

Histograms

Histograms (Groupwise)

Boxplots

Bivariate Plots

Trivariate Plots

Here we take all 3 - variable combinations and plot them. Here also, they seem to be quite well mixed. They do not seem to come from different populations. You may try rotating these plots in different directions yourself.

  • Plot of \((X1,X2,X3)\)
  • Plot of \((X4,X2,X3)\)
  • Plot of \((X4,X5,X3)\)
  • Plot of \((X4,X5,X1)\)

OBSERVATIONS :

  • For every variable , the mean of different groups do not differ much.

  • There are not any significant outliers in the data.

  • From both scatter plots and 3D plots we observe that there is not much separation in the data. In fact, the populations resemble to the point they are mostly indistinguishable.

  • Histograms suggest non-normality for some of the variables.

In Search of Normality

Q-Q Plots

Test of (Univariate) Normality :

Shapiro- Wilk’s Test

  • Testing Normality for \(V1\)

    
        Shapiro-Wilk normality test
    
    data:  D$V1
    W = 0.96989, p-value = 0.02494
  • Testing Normality for \(V2\)

    
        Shapiro-Wilk normality test
    
    data:  D$V2
    W = 0.89111, p-value = 7.643e-07
  • Testing Normality for \(V3\)

    
        Shapiro-Wilk normality test
    
    data:  D$V3
    W = 0.97874, p-value = 0.1171
  • Testing Normality for \(V4\)

    
        Shapiro-Wilk normality test
    
    data:  D$V4
    W = 0.96068, p-value = 0.005362
  • Testing Normality for \(V5\)

    
        Shapiro-Wilk normality test
    
    data:  D$V5
    W = 0.97884, p-value = 0.119

Transformation to convert Non-Normal Variables to Normal

Square Root transformation

As the data is positively skewed ( of a positive variable), we apply the square root transformation to check whether it works. This is because the square root transformation is a standard transformation for such cases. We apply the shapiro Wilk’s test. The result is as follows :

  • \(V1\) (Average number of Symptoms throughout the treatment)

    Shapiro-Wilk normality test

data:  sqrt(D$V1)
W = 0.98719, p-value = 0.473
  • \(V4\) (Average amount of food consumed throughout the treatment))

    Shapiro-Wilk normality test

data:  sqrt(D$V4)
W = 0.966, p-value = 0.01287
  • \(V2\) (Average amount of “activity” throughout the treatment)

    Shapiro-Wilk normality test

data:  sqrt(D$V2)
W = 0.91419, p-value = 9.342e-06

Box-Cox Transformation on \(V2\)

As the square-root transformation does not work, we apply the Box-Cox transformation and check how it performs.

Johnson Transformation on \(V2\)

As the Box-Cox transformation does not work, we apply the Johnson transformation and check how it performs.

Testing the normality of V2 after the transformation


    Shapiro-Wilk normality test

data:  X_2
W = 0.91949, p-value = 1.746e-05

Test of Multivariate Normality

Till now in our analysis, we have tranformed V1 and V4. We have also dropped V2 because of the above mentioned reasons. So now, let the variables be as follows :

1. \(Y_{1}=\sqrt{V_{1}}\)

2. \(Y_{2}=V_{3}\)

3. \(Y_{3}=\sqrt{V_{4}}\)

4. \(Y_{4}=V_{5}\)

Therefore, now we have multivariate vectors \(X=(Y_{1},Y_{2},Y_{3},Y_{4})\). Now we are interested in testing the multivariate normality of X. There are several ways ( both graphical and theoretical ) to check for multivariate normality. We apply the following :

• Graphical Methods :

1. Contour Plots ( For Bivariate Normality ).

2. Perspective Plots ( For Trivariate Normality ).

3. Gamma Plots.

• Tests :

1. Royston’s test

2. Henze-Zirkler’s test

3. Characterization of Multivariate Normality and Bonferonni’s correction.

$multivariateNormality
     Test        H    p value MVN
1 Royston 11.54976 0.02086054  NO

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling    Y1        0.3769    0.4036    YES   
2 Anderson-Darling    Y2        0.5210    0.1809    YES   
3 Anderson-Darling    Y3        1.2746    0.0025    NO    
4 Anderson-Darling    Y4        0.4418    0.2832    YES   

$Descriptives
    n     Mean   Std.Dev   Median      Min      Max     25th     75th
Y1 97 1.783927 0.6255731 1.855802 0.000000 3.234347 1.386723 2.276401
Y2 97 2.143608 0.7575019 2.188000 0.666000 4.000000 1.571000 2.714000
Y3 97 1.482043 0.1119985 1.457738 1.134019 1.713768 1.414214 1.563330
Y4 97 2.581320 0.9311169 2.500000 1.000000 5.000000 1.917000 3.272000
          Skew    Kurtosis
Y1 -0.37431107 -0.02862228
Y2  0.06609072 -0.73988814
Y3  0.12535763 -0.15684912
Y4  0.22779079 -0.63983977

$multivariateOutliers
   Observation Mahalanobis Distance Outlier
12          12               19.125    TRUE
67          67               14.054    TRUE
37          37               13.576    TRUE
76          76               11.816    TRUE
14          14               11.655    TRUE
43          43               11.586    TRUE
55          55               11.534    TRUE

Checking Multivariate Normality (Groupwise): Group 0

$multivariateNormality
     Test        H   p value MVN
1 Royston 5.391579 0.2610785 YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling    V1        0.4359    0.2672    YES   
2 Anderson-Darling    V3        0.2726    0.6282    YES   
3 Anderson-Darling    V4        0.7416    0.0441    NO    
4 Anderson-Darling    V5        0.2254    0.7900    YES   

$Descriptives
    n     Mean   Std.Dev Median   Min    Max  25th   75th       Skew
V1 19 3.638211 2.6657918  3.714 0.000 10.461 1.553 5.1835 0.48003593
V3 19 1.937684 0.9031670  1.941 0.666  4.000 1.091 2.5715 0.40585598
V4 19 2.136474 0.2908127  2.000 1.667  2.727 2.000 2.3305 0.56750134
V5 19 2.545158 0.9224974  2.444 1.000  4.091 2.000 3.3245 0.03166006
      Kurtosis
V1  0.04125762
V3 -0.74223631
V4 -0.74205578
V5 -1.11051003

$multivariateOutliers
   Observation Mahalanobis Distance Outlier
43          43              151.416    TRUE
76          76              141.502    TRUE
17          17               62.964    TRUE
33          33               60.030    TRUE
74          74               59.116    TRUE
36          36               37.902    TRUE

Checking Multivariate Normality (Groupwise): Group 1

$multivariateNormality
     Test        H   p value MVN
1 Royston 7.507276 0.1156166 YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling    V1        0.4697    0.2362    YES   
2 Anderson-Darling    V3        0.4960    0.2031    YES   
3 Anderson-Darling    V4        0.8551    0.0257    NO    
4 Anderson-Darling    V5        0.2234    0.8148    YES   

$Descriptives
    n     Mean   Std.Dev Median   Min   Max  25th  75th      Skew   Kurtosis
V1 45 3.363600 1.9258955  3.333 0.222 7.571 1.750 4.571 0.3569522 -0.8447893
V3 45 2.161333 0.7349420  2.000 0.950 3.689 1.636 2.706 0.1487980 -1.0646483
V4 45 2.217067 0.3282640  2.154 1.611 2.937 2.000 2.500 0.4413692 -0.8256445
V5 45 2.602244 0.9406222  2.539 1.000 5.000 1.917 3.182 0.1836190 -0.4312384

$multivariateOutliers
   Observation Mahalanobis Distance Outlier
7            7               24.488    TRUE
81          81               21.385    TRUE
67          67               20.020    TRUE
86          86               16.591    TRUE
5            5               13.563    TRUE
58          58               12.898    TRUE
71          71               12.190    TRUE

Checking Multivariate Normality (Groupwise): Group 2

$multivariateNormality
     Test        H    p value MVN
1 Royston 8.987751 0.06570336 YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling    V1        0.6174    0.0930    YES   
2 Anderson-Darling    V3        0.6189    0.0922    YES   
3 Anderson-Darling    V4        0.7714    0.0373    NO    
4 Anderson-Darling    V5        0.3093    0.5274    YES   

$Descriptives
    n    Mean   Std.Dev Median   Min   Max    25th    75th       Skew
V1 20 3.57180 1.8605152  2.909 0.800 7.231 2.41625 5.23650  0.4093321
V3 20 2.12470 0.6744109  2.225 0.999 3.154 1.83175 2.51025 -0.4160068
V4 20 2.27230 0.3131959  2.267 1.800 2.889 2.03375 2.34600  0.6508381
V5 20 2.64175 0.9236534  2.506 1.000 4.384 2.04650 3.34275  0.2544680
     Kurtosis
V1 -1.0930361
V3 -1.0324954
V4 -0.7429230
V5 -0.9765935

$multivariateOutliers
   Observation Mahalanobis Distance Outlier
46          46               77.388    TRUE
55          55               60.778    TRUE
54          54               35.592    TRUE
78          78               24.780    TRUE

Checking Multivariate Normality (Groupwise): Group3

$multivariateNormality
     Test        H   p value MVN
1 Royston 2.079578 0.6621113 YES

$univariateNormality
              Test  Variable Statistic   p value Normality
1 Anderson-Darling    V1        0.4793    0.1935    YES   
2 Anderson-Darling    V3        0.2026    0.8433    YES   
3 Anderson-Darling    V4        0.3140    0.5057    YES   
4 Anderson-Darling    V5        0.3467    0.4226    YES   

$Descriptives
    n     Mean   Std.Dev Median   Min   Max  25th  75th       Skew   Kurtosis
V1 13 4.179769 2.5928553  3.667 0.750 9.454 2.444 5.250  0.6451586 -0.8688624
V3 13 2.412308 0.7204971  2.375 1.223 3.818 2.111 2.812  0.0388006 -0.8344643
V4 13 2.188692 0.4466342  2.125 1.286 2.929 1.933 2.455 -0.1631988 -0.7828205
V5 13 2.468769 1.0202204  2.125 1.252 4.428 1.776 3.142  0.5062718 -1.1342029

$multivariateOutliers
   Observation Mahalanobis Distance Outlier
45          45              159.076    TRUE
59          59              133.498    TRUE
12          12              110.191    TRUE
90          90               80.259    TRUE

Dropping V2 Variable

In our study of MANOVA and discriminant analysis, we have decided to drop the V_{2}variable. This variable corresponds to the average amount of “activity” of a patient throughout their treatment. We have already noted that there is ambiguity in the meaning of activity. What is it measuring ? We do not know.

We had assumed that activity means physical activity and it is measured on a scale of 1-5 “by some means”. However, that interpretation is not working due to the following reasons.

• If by activity we mean physical activity, it should have strong association with the average amount of sleep. Let us calculate their correlation :

[1] 0.1839977

This value is surprisingly low.

• If by activity we mean physical activity, it certainly should not be independent from average amount of sleep. We apply Kendall’s \tau test. We note that as it is a non-parametric test, we do not need the assumption of normality. We do however need these variables to be continuous, which we have already assumed.

Kendel’s Tau test:


    Kendall's rank correlation tau

data:  D$V2 and D$V3
z = 2.01, p-value = 0.9778
alternative hypothesis: true tau is less than 0
sample estimates:
      tau 
0.1409736 

We observe that the test strongly accepts the assumption of indepence.

• We have already seen that this variable strongly deviates from normality. Square-root transformation, Box-cox transformation and Johnson’s transformation were applied but none worked. Hence, we CANNOT apply Box’s M test, MANOVA or linear discriminant analysis. In principle we may try to apply Fisher’s linear discriminants as normality assumption is not required. However, we need to check the assumption of homoscedasticity . We do that using Box’s M test. However, that in turn requires the assumption of normality.

• We will later see that Factor Analysis ( discussed in part X ) suggests that there is one single factor affecting/influencing only this variable, independent of another factor affecting all the others. We shall discuss this in great detail in part X.

Testing Equality of Population Covariance Matrices and the Means

Box’s M Test


    Box's M-test for Homogeneity of Covariance Matrices

data:  cbind(D$V1, D$V3, D$V4, D$V5)
Chi-Sq (approx.) = 31.842, df = 30, p-value = 0.3749

MANOVA

          Df  Pillai approx F num Df den Df Pr(>F)
V6         3 0.10225  0.81155     12    276 0.6385
Residuals 93                                      

Linear Discriminant Analysis

Here we observed that observations belonging to different catagories are coming homoscedastic multivariate populations. So the assumptions required to apply LDA are satisfied . Hence we are applying LDA to discriminate the observations which are coming from different parent population.

We apply the method of ( One - Out ) Cross-Validation to assess how well LDA performs.

AER (Estimate) and Confusion Matrix

Principal Component Analysis (PCA)

[1] "sdev"     "rotation" "center"   "scale"    "x"       
Importance of components:
                          PC1    PC2    PC3     PC4
Standard deviation     1.5622 0.8261 0.7889 0.50471
Proportion of Variance 0.6101 0.1706 0.1556 0.06368
Cumulative Proportion  0.6101 0.7807 0.9363 1.00000

Scree Plot

Factor Analysis

Bartlett’s test

$chisq
[1] 173.0565

$p.value
[1] 6.456468e-32

$df
[1] 10

Checking for 1 factor

Factor Analysis using method =  pa
Call: fa(r = D[, -6], nfactors = 1, rotate = "varimax", scores = "regression", 
    max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
    PA1   h2   u2 com
V1 0.65 0.42 0.58   1
V2 0.63 0.40 0.60   1
V3 0.48 0.23 0.77   1
V4 0.72 0.52 0.48   1
V5 0.91 0.83 0.17   1

                PA1
SS loadings    2.40
Proportion Var 0.48

Mean item complexity =  1
Test of the hypothesis that 1 factor is sufficient.

df null model =  10  with the objective function =  1.85 with Chi Square =  173.06
df of  the model are 5  and the objective function was  0.19 

The root mean square of the residuals (RMSR) is  0.07 
The df corrected root mean square of the residuals is  0.1 

The harmonic n.obs is  97 with the empirical chi square  10.18  with prob <  0.07 
The total n.obs was  97  with Likelihood Chi Square =  17.48  with prob <  0.0037 

Tucker Lewis Index of factoring reliability =  0.846
RMSEA index =  0.16  and the 90 % confidence intervals are  0.083 0.247
BIC =  -5.39
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                   PA1
Correlation of (regression) scores with factors   0.94
Multiple R square of scores with factors          0.89
Minimum correlation of possible factor scores     0.77

Checking for 2 factor

Factor Analysis using method =  pa
Call: fa(r = D[, -6], nfactors = 2, rotate = "varimax", scores = "regression", 
    max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
    PA1  PA2   h2     u2 com
V1 0.49 0.38 0.38  0.618 1.9
V2 0.24 1.12 1.31 -0.309 1.1
V3 0.54 0.06 0.29  0.707 1.0
V4 0.66 0.26 0.50  0.497 1.3
V5 0.92 0.28 0.92  0.081 1.2

                       PA1  PA2
SS loadings           1.86 1.55
Proportion Var        0.37 0.31
Cumulative Var        0.37 0.68
Proportion Explained  0.55 0.45
Cumulative Proportion 0.55 1.00

Mean item complexity =  1.3
Test of the hypothesis that 2 factors are sufficient.

df null model =  10  with the objective function =  1.85 with Chi Square =  173.06
df of  the model are 1  and the objective function was  0.03 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.09 

The harmonic n.obs is  97 with the empirical chi square  1.59  with prob <  0.21 
The total n.obs was  97  with Likelihood Chi Square =  2.85  with prob <  0.092 

Tucker Lewis Index of factoring reliability =  0.885
RMSEA index =  0.138  and the 90 % confidence intervals are  0 0.34
BIC =  -1.73
Fit based upon off diagonal values = 1

Model after Dropping V2 (PCA Method): 1 factor

Factor Analysis using method =  pa
Call: fa(r = data.frame(Y1, Y2, Y3, Y4), nfactors = 1, rotate = "varimax", 
    scores = "regression", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
    PA1   h2    u2 com
Y1 0.58 0.34 0.661   1
Y2 0.53 0.28 0.723   1
Y3 0.70 0.49 0.506   1
Y4 0.97 0.93 0.069   1

                PA1
SS loadings    2.04
Proportion Var 0.51

Mean item complexity =  1
Test of the hypothesis that 1 factor is sufficient.

df null model =  6  with the objective function =  1.33 with Chi Square =  124.96
df of  the model are 2  and the objective function was  0.02 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.05 

The harmonic n.obs is  97 with the empirical chi square  0.9  with prob <  0.64 
The total n.obs was  97  with Likelihood Chi Square =  1.47  with prob <  0.48 

Tucker Lewis Index of factoring reliability =  1.013
RMSEA index =  0  and the 90 % confidence intervals are  0 0.185
BIC =  -7.68
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1
Correlation of (regression) scores with factors   0.97
Multiple R square of scores with factors          0.94
Minimum correlation of possible factor scores     0.88

Model after Dropping V2 (PCA Method): 2 factor

Factor Analysis using method =  pa
Call: fa(r = data.frame(Y1, Y2, Y3, Y4), nfactors = 2, rotate = "varimax", 
    scores = "regression", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
    PA1  PA2   h2   u2 com
Y1 0.29 0.56 0.39 0.61 1.5
Y2 0.25 0.51 0.33 0.67 1.4
Y3 0.71 0.33 0.62 0.38 1.4
Y4 0.70 0.62 0.87 0.13 2.0

                       PA1  PA2
SS loadings           1.14 1.07
Proportion Var        0.29 0.27
Cumulative Var        0.29 0.55
Proportion Explained  0.52 0.48
Cumulative Proportion 0.52 1.00

Mean item complexity =  1.6
Test of the hypothesis that 2 factors are sufficient.

df null model =  6  with the objective function =  1.33 with Chi Square =  124.96
df of  the model are -1  and the objective function was  0 

The root mean square of the residuals (RMSR) is  0 
The df corrected root mean square of the residuals is  NA 

The harmonic n.obs is  97 with the empirical chi square  0  with prob <  NA 
The total n.obs was  97  with Likelihood Chi Square =  0  with prob <  NA 

Tucker Lewis Index of factoring reliability =  1.051
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1  PA2
Correlation of (regression) scores with factors   0.78 0.72
Multiple R square of scores with factors          0.61 0.52
Minimum correlation of possible factor scores     0.22 0.04

Model after Dropping V2 (Maximum Likelihood Method):1 factor


Call:
factanal(x = D.Y, factors = 1, scores = c("regression"), rotation = "varimax")

Uniquenesses:
   Y1    Y2    Y3    Y4 
0.682 0.742 0.482 0.051 

Loadings:
   Factor1
Y1 0.564  
Y2 0.508  
Y3 0.720  
Y4 0.974  

               Factor1
SS loadings      2.043
Proportion Var   0.511

Test of the hypothesis that 1 factor is sufficient.
The chi square statistic is 1.12 on 2 degrees of freedom.
The p-value is 0.57 

Conclusion

We conclude the following :

1. Both Box’s M test and MANOVA are being accepted. This implies that groupwise they belong to the same distribution. This was also evident from exploratory data analysis. Therefore, classification does not make much sense for his data. Every type of classification method will perform poorly in this data set.

2. There is a high chance that “activity” which was being measured in this study was not physical activity. One of the side effects of radiotherapy is cognitive decline. Then it will make sense why it is independent of sleep and why another factor is affecting it compared to others.

3. Factor analysis has yielded that there is one factor which only affects activity and as discussed above this may be the “cognitive decline” factor. Among the other variables, one factor is the “Digestive” factor ( one of the side effects of radion is ulceration and stomach sores). There is another factor affecting sleep and number of symptoms. This may be the “patient” factor - how each patient is responding to treatment.