FactorAssumptions

Set of Assumptions for Factor and Principal Component Analysis

Description:Tests for Kaiser-Meyer-Olkin (KMO) and communalities in a dataset. It provides a final sample by removing variables in a iterable manner while keeping account of the variables that were removed in each step.

What is KMO and Communalities?

Factor Analysis and Principal Components Analysis (PCA) have some precautions and assumptions to be observed (Hair et al. (2018)).

The first one is the KMO (Kaiser-Meyer-Olkin) measure, which measures the proportion of variance among the variables that can be derived from the common variance, also called systematic variance. KMO is computed between 0 and 1. Low values (close to 0) indicate that there are large partial correlations in comparison to the sum of the correlations, that is, there is a predominance of correlations of the variables that are problematic for the factorial/principal component analysis. Hair et al. (2018) suggest that individual KMOs smaller than 0.5 be removed from the factorial/principal component analysis. Consequently, this removal causes the overall KMO of the remaining variables of the factor/principal component analysis to be greater than 0.5.

The second assumption of a valid factor or PCA analysis is the communality of the rotated variables. The commonalities indicate the common variance shared by factors/components with certain variables. Greater communality indicated that a greater amount of variance in the variable was extracted by the factorial/principal component solution. For a better measurement of factorial/principal component analysis, communalities should be 0.5 or greater (Hair et al. (2018)).

Loading an example dataset

First we will load an example dataset bfi from psych and load the package FactorAssumptions

library(FactorAssumptions, quietly = T, verbose = F)
bfi_data <- bfi
#Remove rows with missing values and keep only complete cases
bfi_data <- bfi_data[complete.cases(bfi_data),]
head(bfi_data)

##       A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
## 61623  6  6  5  6  5  6  6  6  1  3  2  1  6  5  6  3  5  2  2  3  4  3  5  6
## 61629  4  3  1  5  1  3  2  4  2  4  3  6  4  2  1  6  3  2  6  4  3  2  4  5
## 61634  4  4  5  6  5  4  3  5  3  2  1  3  2  5  4  3  3  4  2  3  5  3  5  6
## 61640  4  5  2  2  1  5  5  5  2  2  3  4  3  6  5  2  4  2  2  3  5  2  5  5
## 61661  1  5  6  5  6  4  3  2  4  5  2  1  2  5  2  2  2  2  2  2  6  1  5  5
## 61664  2  6  5  6  5  3  5  6  3  6  2  2  4  6  6  4  4  4  6  6  6  1  5  6
##       O5 gender education age
## 61623  1      2         3  21
## 61629  3      1         2  19
## 61634  3      1         1  21
## 61640  5      1         1  17
## 61661  2      1         5  68
## 61664  1      2         2  27

Performing the KMO Assumptions

First we will perform the KMO > 0.5assumption for all individuals variables in the dataset with the kmo_optimal_solution function

kmo_bfi <- kmo_optimal_solution(bfi_data, squared = FALSE)

## Final Solution Achieved!

Note that the kmo_optimal_solution outputs a list:

the final solution as df
removed variables with invidualKMO < 0.5 as removed
Anti-image covariance matrix as AIS
Anti-image correlation matrix as AIR

In our case none of the variables were removed due to low individual KMO values

kmo_bfi$removed

## NULL

Performing the Communalities Assumptions

The parallel analysis of bfi data suggests seven factors we will then perform the assumptions for all individualcommunalities > 0.5 with the argument nfactors set to 7.

We can use either the values principal or fa functions from psych package for argument type as desired:

principal will perform a Principal Component Analysis (PCA)
fa will perform a Factor Analysis

Note: we are using the df generated from the kmo_optimal_solution function Note 2: the default of rotation employed by the communalities_optimal_solution is varimax. You can change if you want.

comm_bfi <- communalities_optimal_solution(kmo_bfi$df, type = "principal", nfactors = 7, squared = FALSE)

## There is still an individual communality value below 0.5: A4 - 0.423382853387628

## There is still an individual communality value below 0.5: O4 - 0.4739445052555

## There is still an individual communality value below 0.5: C1 - 0.494613330049184

Note that the communalities_optimal_solution outputs a list:

the final solution as df
removed variables with invidualcommunalities < 0.5 as removed
A table with the communalities loadings from the variables final iteration as loadings
Results of the final iteration of either the principal or fa functions from psych package as results

In our case 3 variables were removed in an iterable fashion due to low individual communality values. And they are listed from the lowest communality that were removed until rendered an optimal solution.

comm_bfi$removed

## [1] "A4" "O4" "C1"

And finally we arrive at our final principal components analysis rotated matrix. You can export it as a CSV with write.csv or write.csv2

comm_bfi$results

## Principal Components Analysis
## Call: principal(r = df, nfactors = nfactors, scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
##             RC2   RC1   RC5   RC4   RC3   RC6   RC7   h2   u2 com
## A1         0.13  0.13 -0.51  0.09  0.22  0.46 -0.22 0.60 0.40 3.2
## A2         0.03  0.15  0.69  0.14 -0.05 -0.21  0.08 0.57 0.43 1.4
## A3        -0.02  0.21  0.75  0.11  0.01  0.00 -0.02 0.62 0.38 1.2
## A5        -0.17  0.30  0.67  0.07  0.05  0.09  0.02 0.59 0.41 1.6
## C2         0.11 -0.04  0.17  0.72 -0.08  0.14 -0.05 0.59 0.41 1.3
## C3        -0.02 -0.01  0.13  0.72  0.08  0.06  0.09 0.55 0.45 1.1
## C4         0.22 -0.13  0.04 -0.70  0.22  0.22 -0.06 0.66 0.34 1.8
## C5         0.29 -0.21  0.01 -0.67  0.03  0.12  0.06 0.59 0.41 1.7
## E1        -0.01 -0.74 -0.12  0.10  0.13  0.22  0.00 0.63 0.37 1.3
## E2         0.22 -0.75 -0.16 -0.07  0.06  0.02 -0.06 0.65 0.35 1.3
## E3         0.03  0.51  0.43  0.09 -0.18  0.31 -0.09 0.59 0.41 3.1
## E4        -0.15  0.64  0.41  0.07  0.15  0.07 -0.10 0.64 0.36 2.1
## E5         0.09  0.57  0.17  0.34 -0.14  0.16  0.15 0.55 0.45 2.5
## N1         0.83  0.09 -0.18 -0.05  0.08  0.03  0.01 0.73 0.27 1.1
## N2         0.82  0.07 -0.17 -0.03  0.00 -0.04  0.00 0.71 0.29 1.1
## N3         0.79 -0.07  0.01 -0.07  0.01 -0.03 -0.07 0.65 0.35 1.1
## N4         0.63 -0.42  0.04 -0.18 -0.03  0.09  0.06 0.62 0.38 2.0
## N5         0.61 -0.20  0.15 -0.03  0.17 -0.20 -0.12 0.52 0.48 1.9
## O1         0.01  0.15  0.19  0.12 -0.47  0.49  0.03 0.54 0.46 2.6
## O2         0.15 -0.02  0.12 -0.08  0.72  0.04 -0.03 0.57 0.43 1.2
## O3         0.06  0.26  0.27  0.04 -0.57  0.34  0.02 0.59 0.41 2.7
## O5         0.04 -0.01 -0.02 -0.02  0.76  0.05 -0.03 0.59 0.41 1.0
## gender     0.19  0.16  0.19  0.11  0.01 -0.66 -0.02 0.55 0.45 1.5
## education  0.00 -0.03  0.04 -0.01 -0.07  0.06  0.77 0.60 0.40 1.0
## age       -0.08  0.06  0.03  0.06 -0.01 -0.08  0.77 0.61 0.39 1.1
## 
##                        RC2  RC1  RC5  RC4  RC3  RC6  RC7
## SS loadings           3.09 2.69 2.47 2.23 1.90 1.38 1.32
## Proportion Var        0.12 0.11 0.10 0.09 0.08 0.06 0.05
## Cumulative Var        0.12 0.23 0.33 0.42 0.50 0.55 0.60
## Proportion Explained  0.20 0.18 0.16 0.15 0.13 0.09 0.09
## Cumulative Proportion 0.20 0.38 0.55 0.69 0.82 0.91 1.00
## 
## Mean item complexity =  1.7
## Test of the hypothesis that 7 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.06 
##  with the empirical chi square  4194.08  with prob <  0 
## 
## Fit based upon off diagonal values = 0.92

References

Hair, Joseph F., William C. Black, Barry J. Babin, and Rolph E. Anderson. 2018. Multivariate Data Analysis. 8th ed. Cengage Learning.

- FactorAssumptions
- References