About The Project
This was a (joint) course project in the Multivariate course alongside Debepsita Mukherjee and Rupayan Mandal. It was supervised by Dr. Swagata Nandi. In this project we have analyzed radiotherapy data. The set contains data on people undergoing radiotherapy. Variables like number of symptoms, average appetite and average sleep (etc) throughout their treatment were considered. The dataset contained a categorical variable for skin reaction. The analysis includes EDA, normality checking and transformations, MANOVA, LDA, PCA and factor analysis. By studying the correlation structure of the data and Kendall’s tau test we conclude that one of the variables is independent from others and consequently we drop that from our analysis. We checked whether the observations in different categories belong to the same population. From EDA and MANOVA we conclude that the observations indeed come from the same population. Consequently, discriminant analysis is not meaningful. This fact is supported by APER calculated using one-out cross validation method. From factor analysis we conclude that one variable is significantly different compared to others. Also among the other four, there are two factors affecting the variables. One factor is affecting appetite and food consumed. This factor could be due to the treatment’s effect on mouth/digestive system. The other is affecting the number of symptoms and sleep.
This blog mostly consists of plots and outputs with as very little explanations. For a complete report with proper explanations kindly go through our project report which is available on my GitHub page.
INTRODUCTION
In this project we are interested to check whether we can identify different factors influencing different aspects like appetite and sleep of a patient throughout their treatment. We are also interested in checking whether we can classify different patients based on their skin reaction ( to the treatment).
Radiotherapy and its side effects
Radiation therapy or radiotherapy is a therapy using ionizing radiation, generally provided as part of cancer treatment to control or kill malignant cells. Radiation therapy is in itself painless. It causes minimal or no side effects, although short-term pain flare-up can be experienced in the days following treatment. Side effects from radiation are usually limited to the area of the patient’s body that is under treatment. Some of the potential side effects are as follows :
1. Nausea and vomiting.
2. (Skin Reaction) Epithelial surfaces may sustain damage from radiation therapy.This may include the skin, oral mucosa, etc.
3. If the head and neck area is treated, temporary soreness and ulceration commonly occur in the mouth and throat. If severe, this can affect swallowing
4. Cognitive decline.
In this project we are interested to check whether we can identify different factors influencing different aspects like appetite and sleep of a patient throughout their treatment. We are also interested in checking whether we can classify different patients based on their skin reaction ( to the treatment).
Data Description :
Average ratings over the course of treatment for patients undergoing radiotherapy.
1. Col. 1: (V1) = number of symptoms
2. Col. 2: (V2) = amount of activity (1-5 scale)
3. Col. 3: (V3) = amount of sleep (1-5 scale)
4. Col. 4: (V4) = amount of food consumed (1-3 scale)
5. Col. 5: (V5) = appetite (1-5 scale) )
6. Col. 6: (V6) = skin reaction (0, 1, 2 or 3)
As they have taken average of ratings, we have assumed that V_{1},..,V_{5} are continuous variables.
In this report Vi or V_{i} will be interchangeably used for the following (i=1,..,6) :
• V1 : AVERAGE NUMBER OF SYMPTOMS THROUGHOUT THE TREATMENT
• V2: AVERAGE AMOUNT OF ACTIVITY THROUGHOUT THE TREATMENT
• V3: AVERAGE AMOUNT OF SLEEP THROUGHOUT THE TREATMENT
• V4: AVERAGE AMOUNT OF FOOD CONSUMED THROUGHOUT THE TREATMENT
• V5 : V2: AVERAGE AMOUNT OF APPETITE THROUGHOUT THE TREATMENT
• V2: SKIN REACTION AFTER THE TREATMENT
Exploratory Data Analysis
Histograms
Histograms (Groupwise)
Boxplots
Bivariate Plots
Trivariate Plots
Here we take all 3 - variable combinations and plot them. Here also, they seem to be quite well mixed. They do not seem to come from different populations. You may try rotating these plots in different directions yourself.
- Plot of \((X1,X2,X3)\)
- Plot of \((X4,X2,X3)\)
- Plot of \((X4,X5,X3)\)
- Plot of \((X4,X5,X1)\)
OBSERVATIONS :
For every variable , the mean of different groups do not differ much.
There are not any significant outliers in the data.
From both scatter plots and 3D plots we observe that there is not much separation in the data. In fact, the populations resemble to the point they are mostly indistinguishable.
Histograms suggest non-normality for some of the variables.
In Search of Normality
Q-Q Plots
Test of (Univariate) Normality :
Shapiro- Wilk’s Test
Testing Normality for \(V1\)
Shapiro-Wilk normality test data: D$V1 W = 0.96989, p-value = 0.02494
Testing Normality for \(V2\)
Shapiro-Wilk normality test data: D$V2 W = 0.89111, p-value = 7.643e-07
Testing Normality for \(V3\)
Shapiro-Wilk normality test data: D$V3 W = 0.97874, p-value = 0.1171
Testing Normality for \(V4\)
Shapiro-Wilk normality test data: D$V4 W = 0.96068, p-value = 0.005362
Testing Normality for \(V5\)
Shapiro-Wilk normality test data: D$V5 W = 0.97884, p-value = 0.119
Transformation to convert Non-Normal Variables to Normal
Square Root transformation
As the data is positively skewed ( of a positive variable), we apply the square root transformation to check whether it works. This is because the square root transformation is a standard transformation for such cases. We apply the shapiro Wilk’s test. The result is as follows :
- \(V1\) (Average number of Symptoms throughout the treatment)
Shapiro-Wilk normality test
data: sqrt(D$V1)
W = 0.98719, p-value = 0.473
- \(V4\) (Average amount of food consumed throughout the treatment))
Shapiro-Wilk normality test
data: sqrt(D$V4)
W = 0.966, p-value = 0.01287
- \(V2\) (Average amount of “activity” throughout the treatment)
Shapiro-Wilk normality test
data: sqrt(D$V2)
W = 0.91419, p-value = 9.342e-06
Box-Cox Transformation on \(V2\)
As the square-root transformation does not work, we apply the Box-Cox transformation and check how it performs.
Johnson Transformation on \(V2\)
As the Box-Cox transformation does not work, we apply the Johnson transformation and check how it performs.
Testing the normality of V2 after the transformation
Shapiro-Wilk normality test
data: X_2
W = 0.91949, p-value = 1.746e-05
Test of Multivariate Normality
Till now in our analysis, we have tranformed V1 and V4. We have also dropped V2 because of the above mentioned reasons. So now, let the variables be as follows :
1. \(Y_{1}=\sqrt{V_{1}}\)
2. \(Y_{2}=V_{3}\)
3. \(Y_{3}=\sqrt{V_{4}}\)
4. \(Y_{4}=V_{5}\)
Therefore, now we have multivariate vectors \(X=(Y_{1},Y_{2},Y_{3},Y_{4})\). Now we are interested in testing the multivariate normality of X. There are several ways ( both graphical and theoretical ) to check for multivariate normality. We apply the following :
• Graphical Methods :
1. Contour Plots ( For Bivariate Normality ).
2. Perspective Plots ( For Trivariate Normality ).
3. Gamma Plots.
• Tests :
1. Royston’s test
2. Henze-Zirkler’s test
3. Characterization of Multivariate Normality and Bonferonni’s correction.
$multivariateNormality
Test H p value MVN
1 Royston 11.54976 0.02086054 NO
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling Y1 0.3769 0.4036 YES
2 Anderson-Darling Y2 0.5210 0.1809 YES
3 Anderson-Darling Y3 1.2746 0.0025 NO
4 Anderson-Darling Y4 0.4418 0.2832 YES
$Descriptives
n Mean Std.Dev Median Min Max 25th 75th
Y1 97 1.783927 0.6255731 1.855802 0.000000 3.234347 1.386723 2.276401
Y2 97 2.143608 0.7575019 2.188000 0.666000 4.000000 1.571000 2.714000
Y3 97 1.482043 0.1119985 1.457738 1.134019 1.713768 1.414214 1.563330
Y4 97 2.581320 0.9311169 2.500000 1.000000 5.000000 1.917000 3.272000
Skew Kurtosis
Y1 -0.37431107 -0.02862228
Y2 0.06609072 -0.73988814
Y3 0.12535763 -0.15684912
Y4 0.22779079 -0.63983977
$multivariateOutliers
Observation Mahalanobis Distance Outlier
12 12 19.125 TRUE
67 67 14.054 TRUE
37 37 13.576 TRUE
76 76 11.816 TRUE
14 14 11.655 TRUE
43 43 11.586 TRUE
55 55 11.534 TRUE
Checking Multivariate Normality (Groupwise): Group 0
$multivariateNormality
Test H p value MVN
1 Royston 5.391579 0.2610785 YES
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling V1 0.4359 0.2672 YES
2 Anderson-Darling V3 0.2726 0.6282 YES
3 Anderson-Darling V4 0.7416 0.0441 NO
4 Anderson-Darling V5 0.2254 0.7900 YES
$Descriptives
n Mean Std.Dev Median Min Max 25th 75th Skew
V1 19 3.638211 2.6657918 3.714 0.000 10.461 1.553 5.1835 0.48003593
V3 19 1.937684 0.9031670 1.941 0.666 4.000 1.091 2.5715 0.40585598
V4 19 2.136474 0.2908127 2.000 1.667 2.727 2.000 2.3305 0.56750134
V5 19 2.545158 0.9224974 2.444 1.000 4.091 2.000 3.3245 0.03166006
Kurtosis
V1 0.04125762
V3 -0.74223631
V4 -0.74205578
V5 -1.11051003
$multivariateOutliers
Observation Mahalanobis Distance Outlier
43 43 151.416 TRUE
76 76 141.502 TRUE
17 17 62.964 TRUE
33 33 60.030 TRUE
74 74 59.116 TRUE
36 36 37.902 TRUE
Checking Multivariate Normality (Groupwise): Group 1
$multivariateNormality
Test H p value MVN
1 Royston 7.507276 0.1156166 YES
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling V1 0.4697 0.2362 YES
2 Anderson-Darling V3 0.4960 0.2031 YES
3 Anderson-Darling V4 0.8551 0.0257 NO
4 Anderson-Darling V5 0.2234 0.8148 YES
$Descriptives
n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis
V1 45 3.363600 1.9258955 3.333 0.222 7.571 1.750 4.571 0.3569522 -0.8447893
V3 45 2.161333 0.7349420 2.000 0.950 3.689 1.636 2.706 0.1487980 -1.0646483
V4 45 2.217067 0.3282640 2.154 1.611 2.937 2.000 2.500 0.4413692 -0.8256445
V5 45 2.602244 0.9406222 2.539 1.000 5.000 1.917 3.182 0.1836190 -0.4312384
$multivariateOutliers
Observation Mahalanobis Distance Outlier
7 7 24.488 TRUE
81 81 21.385 TRUE
67 67 20.020 TRUE
86 86 16.591 TRUE
5 5 13.563 TRUE
58 58 12.898 TRUE
71 71 12.190 TRUE
Checking Multivariate Normality (Groupwise): Group 2
$multivariateNormality
Test H p value MVN
1 Royston 8.987751 0.06570336 YES
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling V1 0.6174 0.0930 YES
2 Anderson-Darling V3 0.6189 0.0922 YES
3 Anderson-Darling V4 0.7714 0.0373 NO
4 Anderson-Darling V5 0.3093 0.5274 YES
$Descriptives
n Mean Std.Dev Median Min Max 25th 75th Skew
V1 20 3.57180 1.8605152 2.909 0.800 7.231 2.41625 5.23650 0.4093321
V3 20 2.12470 0.6744109 2.225 0.999 3.154 1.83175 2.51025 -0.4160068
V4 20 2.27230 0.3131959 2.267 1.800 2.889 2.03375 2.34600 0.6508381
V5 20 2.64175 0.9236534 2.506 1.000 4.384 2.04650 3.34275 0.2544680
Kurtosis
V1 -1.0930361
V3 -1.0324954
V4 -0.7429230
V5 -0.9765935
$multivariateOutliers
Observation Mahalanobis Distance Outlier
46 46 77.388 TRUE
55 55 60.778 TRUE
54 54 35.592 TRUE
78 78 24.780 TRUE
Checking Multivariate Normality (Groupwise): Group3
$multivariateNormality
Test H p value MVN
1 Royston 2.079578 0.6621113 YES
$univariateNormality
Test Variable Statistic p value Normality
1 Anderson-Darling V1 0.4793 0.1935 YES
2 Anderson-Darling V3 0.2026 0.8433 YES
3 Anderson-Darling V4 0.3140 0.5057 YES
4 Anderson-Darling V5 0.3467 0.4226 YES
$Descriptives
n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis
V1 13 4.179769 2.5928553 3.667 0.750 9.454 2.444 5.250 0.6451586 -0.8688624
V3 13 2.412308 0.7204971 2.375 1.223 3.818 2.111 2.812 0.0388006 -0.8344643
V4 13 2.188692 0.4466342 2.125 1.286 2.929 1.933 2.455 -0.1631988 -0.7828205
V5 13 2.468769 1.0202204 2.125 1.252 4.428 1.776 3.142 0.5062718 -1.1342029
$multivariateOutliers
Observation Mahalanobis Distance Outlier
45 45 159.076 TRUE
59 59 133.498 TRUE
12 12 110.191 TRUE
90 90 80.259 TRUE
Dropping V2 Variable
In our study of MANOVA and discriminant analysis, we have decided to drop the V_{2}variable. This variable corresponds to the average amount of “activity” of a patient throughout their treatment. We have already noted that there is ambiguity in the meaning of activity. What is it measuring ? We do not know.
We had assumed that activity means physical activity and it is measured on a scale of 1-5 “by some means”. However, that interpretation is not working due to the following reasons.
• If by activity we mean physical activity, it should have strong association with the average amount of sleep. Let us calculate their correlation :
[1] 0.1839977
This value is surprisingly low.
• If by activity we mean physical activity, it certainly should not be independent from average amount of sleep. We apply Kendall’s \tau test. We note that as it is a non-parametric test, we do not need the assumption of normality. We do however need these variables to be continuous, which we have already assumed.
Kendel’s Tau test:
Kendall's rank correlation tau
data: D$V2 and D$V3
z = 2.01, p-value = 0.9778
alternative hypothesis: true tau is less than 0
sample estimates:
tau
0.1409736
We observe that the test strongly accepts the assumption of indepence.
• We have already seen that this variable strongly deviates from normality. Square-root transformation, Box-cox transformation and Johnson’s transformation were applied but none worked. Hence, we CANNOT apply Box’s M test, MANOVA or linear discriminant analysis. In principle we may try to apply Fisher’s linear discriminants as normality assumption is not required. However, we need to check the assumption of homoscedasticity . We do that using Box’s M test. However, that in turn requires the assumption of normality.
• We will later see that Factor Analysis ( discussed in part X ) suggests that there is one single factor affecting/influencing only this variable, independent of another factor affecting all the others. We shall discuss this in great detail in part X.
Testing Equality of Population Covariance Matrices and the Means
Box’s M Test
Box's M-test for Homogeneity of Covariance Matrices
data: cbind(D$V1, D$V3, D$V4, D$V5)
Chi-Sq (approx.) = 31.842, df = 30, p-value = 0.3749
MANOVA
Df Pillai approx F num Df den Df Pr(>F)
V6 3 0.10225 0.81155 12 276 0.6385
Residuals 93
Linear Discriminant Analysis
Here we observed that observations belonging to different catagories are coming homoscedastic multivariate populations. So the assumptions required to apply LDA are satisfied . Hence we are applying LDA to discriminate the observations which are coming from different parent population.
We apply the method of ( One - Out ) Cross-Validation to assess how well LDA performs.
AER (Estimate) and Confusion Matrix
Principal Component Analysis (PCA)
[1] "sdev" "rotation" "center" "scale" "x"
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.5622 0.8261 0.7889 0.50471
Proportion of Variance 0.6101 0.1706 0.1556 0.06368
Cumulative Proportion 0.6101 0.7807 0.9363 1.00000
Scree Plot
Factor Analysis
Bartlett’s test
$chisq
[1] 173.0565
$p.value
[1] 6.456468e-32
$df
[1] 10
Checking for 1 factor
Factor Analysis using method = pa
Call: fa(r = D[, -6], nfactors = 1, rotate = "varimax", scores = "regression",
max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 h2 u2 com
V1 0.65 0.42 0.58 1
V2 0.63 0.40 0.60 1
V3 0.48 0.23 0.77 1
V4 0.72 0.52 0.48 1
V5 0.91 0.83 0.17 1
PA1
SS loadings 2.40
Proportion Var 0.48
Mean item complexity = 1
Test of the hypothesis that 1 factor is sufficient.
df null model = 10 with the objective function = 1.85 with Chi Square = 173.06
df of the model are 5 and the objective function was 0.19
The root mean square of the residuals (RMSR) is 0.07
The df corrected root mean square of the residuals is 0.1
The harmonic n.obs is 97 with the empirical chi square 10.18 with prob < 0.07
The total n.obs was 97 with Likelihood Chi Square = 17.48 with prob < 0.0037
Tucker Lewis Index of factoring reliability = 0.846
RMSEA index = 0.16 and the 90 % confidence intervals are 0.083 0.247
BIC = -5.39
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
PA1
Correlation of (regression) scores with factors 0.94
Multiple R square of scores with factors 0.89
Minimum correlation of possible factor scores 0.77
Checking for 2 factor
Factor Analysis using method = pa
Call: fa(r = D[, -6], nfactors = 2, rotate = "varimax", scores = "regression",
max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
V1 0.49 0.38 0.38 0.618 1.9
V2 0.24 1.12 1.31 -0.309 1.1
V3 0.54 0.06 0.29 0.707 1.0
V4 0.66 0.26 0.50 0.497 1.3
V5 0.92 0.28 0.92 0.081 1.2
PA1 PA2
SS loadings 1.86 1.55
Proportion Var 0.37 0.31
Cumulative Var 0.37 0.68
Proportion Explained 0.55 0.45
Cumulative Proportion 0.55 1.00
Mean item complexity = 1.3
Test of the hypothesis that 2 factors are sufficient.
df null model = 10 with the objective function = 1.85 with Chi Square = 173.06
df of the model are 1 and the objective function was 0.03
The root mean square of the residuals (RMSR) is 0.03
The df corrected root mean square of the residuals is 0.09
The harmonic n.obs is 97 with the empirical chi square 1.59 with prob < 0.21
The total n.obs was 97 with Likelihood Chi Square = 2.85 with prob < 0.092
Tucker Lewis Index of factoring reliability = 0.885
RMSEA index = 0.138 and the 90 % confidence intervals are 0 0.34
BIC = -1.73
Fit based upon off diagonal values = 1
Model after Dropping V2 (PCA Method): 1 factor
Factor Analysis using method = pa
Call: fa(r = data.frame(Y1, Y2, Y3, Y4), nfactors = 1, rotate = "varimax",
scores = "regression", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 h2 u2 com
Y1 0.58 0.34 0.661 1
Y2 0.53 0.28 0.723 1
Y3 0.70 0.49 0.506 1
Y4 0.97 0.93 0.069 1
PA1
SS loadings 2.04
Proportion Var 0.51
Mean item complexity = 1
Test of the hypothesis that 1 factor is sufficient.
df null model = 6 with the objective function = 1.33 with Chi Square = 124.96
df of the model are 2 and the objective function was 0.02
The root mean square of the residuals (RMSR) is 0.03
The df corrected root mean square of the residuals is 0.05
The harmonic n.obs is 97 with the empirical chi square 0.9 with prob < 0.64
The total n.obs was 97 with Likelihood Chi Square = 1.47 with prob < 0.48
Tucker Lewis Index of factoring reliability = 1.013
RMSEA index = 0 and the 90 % confidence intervals are 0 0.185
BIC = -7.68
Fit based upon off diagonal values = 1
Measures of factor score adequacy
PA1
Correlation of (regression) scores with factors 0.97
Multiple R square of scores with factors 0.94
Minimum correlation of possible factor scores 0.88
Model after Dropping V2 (PCA Method): 2 factor
Factor Analysis using method = pa
Call: fa(r = data.frame(Y1, Y2, Y3, Y4), nfactors = 2, rotate = "varimax",
scores = "regression", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
Y1 0.29 0.56 0.39 0.61 1.5
Y2 0.25 0.51 0.33 0.67 1.4
Y3 0.71 0.33 0.62 0.38 1.4
Y4 0.70 0.62 0.87 0.13 2.0
PA1 PA2
SS loadings 1.14 1.07
Proportion Var 0.29 0.27
Cumulative Var 0.29 0.55
Proportion Explained 0.52 0.48
Cumulative Proportion 0.52 1.00
Mean item complexity = 1.6
Test of the hypothesis that 2 factors are sufficient.
df null model = 6 with the objective function = 1.33 with Chi Square = 124.96
df of the model are -1 and the objective function was 0
The root mean square of the residuals (RMSR) is 0
The df corrected root mean square of the residuals is NA
The harmonic n.obs is 97 with the empirical chi square 0 with prob < NA
The total n.obs was 97 with Likelihood Chi Square = 0 with prob < NA
Tucker Lewis Index of factoring reliability = 1.051
Fit based upon off diagonal values = 1
Measures of factor score adequacy
PA1 PA2
Correlation of (regression) scores with factors 0.78 0.72
Multiple R square of scores with factors 0.61 0.52
Minimum correlation of possible factor scores 0.22 0.04
Model after Dropping V2 (Maximum Likelihood Method):1 factor
Call:
factanal(x = D.Y, factors = 1, scores = c("regression"), rotation = "varimax")
Uniquenesses:
Y1 Y2 Y3 Y4
0.682 0.742 0.482 0.051
Loadings:
Factor1
Y1 0.564
Y2 0.508
Y3 0.720
Y4 0.974
Factor1
SS loadings 2.043
Proportion Var 0.511
Test of the hypothesis that 1 factor is sufficient.
The chi square statistic is 1.12 on 2 degrees of freedom.
The p-value is 0.57
Conclusion
We conclude the following :
1. Both Box’s M test and MANOVA are being accepted. This implies that groupwise they belong to the same distribution. This was also evident from exploratory data analysis. Therefore, classification does not make much sense for his data. Every type of classification method will perform poorly in this data set.
2. There is a high chance that “activity” which was being measured in this study was not physical activity. One of the side effects of radiotherapy is cognitive decline. Then it will make sense why it is independent of sleep and why another factor is affecting it compared to others.
3. Factor analysis has yielded that there is one factor which only affects activity and as discussed above this may be the “cognitive decline” factor. Among the other variables, one factor is the “Digestive” factor ( one of the side effects of radion is ulceration and stomach sores). There is another factor affecting sleep and number of symptoms. This may be the “patient” factor - how each patient is responding to treatment.