Factorial validity and measurement invariance across gender groups of the German version of the Interpersonal Reactivity Index

The Interpersonal Reactivity Index (IRI) is the most widely used measure of empathy, but its factorial validity has been questioned. The present research investigates the factorial validity of the German adaptation of the IRI, the “Saarbrücker Persönlichkeitsfragebogen SPF-IRI”. Confirmatory Factor Analyses (CFA) and Exploratory Structural Equation Modeling (ESEM) were used to test the theoretically predicted four-factor model. Across two subsamples ESEM outperformed CFA. Substantial cross-loadings were evident in ESEM. Measurement invariance (MI) across gender groups was tested using ESEM in the combined sample. Strict MI (invariant factor loadings, intercepts, residuals) could be established, and variances and covariances were also equal. Differences for latent means were evident. Women scored higher on fantasy, empathic concern, and personal distress. No significant differences were found for perspective taking. Mean differences were due to real differences on latent variables and not a result of measurement bias. Results support the factorial validity of the German SPF-IRI. The heterogeneity of empathy and the unclear differentiation between cognitive and emotional aspects might be a source for the unclear differentiation of scales.

. PD indicates a susceptibility to social stress and thus has been associated with social anxiety, shyness, loneliness, and difficulties in social interactions (Carmel and Glick, 1996;Cliffordson, 2002;Davis, 1983). Empathy deficits have been documented for patients with schizophrenia (Abramowitz, Ginger, Gollan, and Smith, 2014;Smith et al. 2012). A recent meta-analysis showed that patients scored lower on EC, PT, and FT, but higher on PD compared to healthy controls (Bonfils, Lysaker, Minor, and Salyers, 2017).
Empathy has also been related to the Big Five traits (Lee, 2009;Mooradian, Davis, and Matzler, 2011). PT and EC have shown small negative correlations with neuroticism. FT has shown small positive correlations with neuroticism. The strongest associations were found between PD and neuroticism. Associations between empathy and the Big Five traits were also confirmed to be cross-culturally valid (Melchers et al. 2016).
The IRI is the most widely used measure of empathy, yet criticism remains. Several studies have confirmed a four-dimensional structure across various languages, e.g., French (Gilet et al. 2013), Spanish (Fernández et al. 2011;Garcia-Barrera, Karr, Trujillo-Orrego, Trujillo-Orrego, and Pineda, 2017), and Dutch (De Corte et al. 2007;Hawk et al. 2013). In many cases, confirmatory factor analyses (CFA) has revealed problems. Researchers had to drop items from the scale or use item parceling. In most cases, acceptable model fit following conventional cut-offs could not be achieved.
A German adaptation of the IRI was provided by Paulus (2009) and named "Saarbrücker Persönlichkeitsfragenbogen SPF (IRI)"; to be called SPF-IRI from here on. It is a 16-item version of the original 28-item IRI and a four-factor structure was established using exploratory factor analyses (EFA). Koller and Lamm (2015) conducted an analysis of the SPF-IRI on the basis of item response theory. Results indicated that only the subscale empathic concern conformed to the assumption of a partial credit model. Especially the personal distress subscale was evaluated critically. To this date, no extensive evaluation of factor structure of the SPF-IRI has been published. Hence, there is a need to show that the adapted and shortened German version of IRI still complies with a similar factor structure when compared to the international versions.

Measurement and factor analysis
Knowledge of the internal structure of a measure provides a basic understanding of the quality of measurement. Factor analysis is at the heart of current methodological approaches to investigations of internal structure. Aggregation to sum scores, estimations of reliability, and finally associations to other constructs can only meaningfully be interpreted if the internal structure of a test has been determined (Brown, 2006). Traditionally, multi-dimensional constructs (such as empathy) are expected to comply to a simple structure (Thurstone, 1934). In a nutshell, simple structure assumes that an item shows strong correlations with other items belonging to the same factor, yet ideally zero correlations with items belonging to other factors. Simple structure has often been considered a fundamental principal, in order to interpret the results of factor analyses (Kline, 1994). Contrasting these long-standing heuristics, more recent research has shown that constructs in the domain of personality can be more complex and that imposing simple structure (i.e., by removing non-conforming items) may degrade test information and increase standard errors (Pettersson and Turkheimer, 2014). CFA imposes simple structure, as items are commonly associated to a single factor only (unless explicitly specified otherwise). Cross-loadings on other factors are set to zero. The SPF-IRI has shown acceptable-yet less than ideal-internal consistencies ranging from alpha = .66 to .74 (Paulus, 2009). During the development of the SPF-IRI, Paulus (2009) documented considerable crossloadings for several items. Due to the lack of available data, one can only speculate that this could be a reason why the factor structure of the SPF-IRI has not been confirmed using CFA. Hence, it is probable that the SPF-IRI does not perfectly comply with the assumption of a simple structure.
Exploratory Structural Equation Modeling (ESEM) has been introduced as an alternative method to evaluate the factor structure of scales Marsh, Morin, Parker, and Kaur, 2014). ESEM aims to combine the advantages of both EFA and CFA. It is less restrictive and freely estimates item loadings on all factors. ESEM is thought to offer a more realistic approach to common personality constructs, following the assumption that CFA is often too restrictive (Marsh et al. 2014). However, ESEM does not necessarily provide more insight. For example, good model fit in ESEM could also be achieved when different item-to-factor associations emerge. Thus, one has to carefully examine the patterns of factor loadings to assure that results are comparable to a theoretically predicted model. Still, ESEM has considerable advantages over CFA when substantial cross-loadings are to be expected. Simulation studies showed that even small cross-loadings-as small as .10-should be taken into account to prevent biased estimates (Asparouhov, Muthén, and Morin, 2015). Lucas-Molina and colleagues (Lucas-Molina et al. 2017) attempted an ESEM analysis of the Spanish version of IRI. ESEM proved to be superior to CFA, even though model fit was only barely acceptable.

Measurement invariance
Measurement invariance (MI) maintains that a valid measurement model has to hold across different samples, in order to compare scores across contexts, times, or groups of participants (Vandenberg and Lance, 2000). MI ensures that scores reflect the same latent construct to the same degree. Many analyses we commonly take for granted, such as correlations or comparisons of mean scores across groups, are only valid as far as MI holds (Chen, 2008). Otherwise, unequal measurement might obscure or bias true associations or differences. One needs to ascertain that any differences in scale means (or latent means) are due to true differences, not different item utilization (different loadings) or item bias (item difficulty). In the case of the IRI, gender differences have emerged in many studies. As MI is commonly tested in a CFA framework and due to the lack of an accepted factor model based on CFA, MI for groups of women and men has not been tested for the SPF-IRI.
MI is commonly tested using a series of increasingly restrictive CFA models (Brown, 2006;Meredith, 1993;Vandenberg and Lance, 2000). Marsh and colleagues (Marsh et al. 2009) provided an extensive taxonomy to test MI using 13 partially nested ESEM models, yet basically the invariance of five different groups of parameters is tested in various combinations. I will thus comply with the more traditional approach (Vandenberg and Lance, 2000). Four stages of MI are commonly tested: Configural MI (M1) indicates equal construct dimensionality and item-to-factor associations across groups. Factor loadings, item intercepts, and residuals can still differ. Due to the nature of ESEM, where all factors are associated with all items, this step is basically meaningless for ESEM, yet it provides a reference model for later tests. Metric MI (M2) indicates that all factor loadings are equal across groups. In case of ESEM this includes all loadings of an item on all factors. Scalar MI (M3) assumes that all item intercepts are equal. Strict MI (M4) also requires equal item residuals. Additionally, one can test the equality of structural parameters, such as factor variances and covariances (M5), and factor means (M6). Different levels of MI have different implications. Metric MI indicates that the same psychological meaning is captured. Metric MI is commonly considered to allow for a comparison of (latent) analyses of variance/covariance structures, such as correlations (van de Schoot, Lugtig, and Hox, 2012). On a methodologically strict level, metric MI allows comparing unstandardized covariances. Comparing correlations, i.e., standardized coefficients, technically additionally requires equal factor variances. Scalar MI allows for a comparison of (latent) factor means. Strict MI indicates that differences reflect true differences on the latent variables, rather than random measurement error. This assures equal reliability and one can directly compare scale means.

Study overview
Due to the lack of a proper investigation of the internal structure of the SPF-IRI, the present research aims to investigate two research questions: RQ1: Establishing a suitable factor model using CFA or ESEM. I will compare CFA and ESEM models in two subsamples. I hypothesize that CFAs will fail to show acceptable fit, whereas ESEM will be superior. In order to cross-validate the factor model, I will use two subsamples from two separate studies.
RQ2: Once a valid factor model has been established, I will then investigate MI across gender groups in the full, combined sample. The combined sample will be used at this stage, in order to maximize available participants, given the comparatively lower number of men vs. women that is all too common in psychological studies.

Procedure and participants
The data analyzed in the present study came from two subsamples. All participants completed online studies. In both studies, data were collected in accordance with the ethical standards of the institutional and national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants and participation was completely voluntary. Subsample 1 included N = 1033 (75.2% female) participants with a mean age of 41.83 years (SD = 14.14; range = 13-83). Subsample 2 included N = 1842 participants (85.5% female; M age = 28.11; SD age = 9.22; range = 15-77). The full sample thus included N = 2875 participants (81.8% female; M age = 33.05; SD age = 13.03). Descriptives are presented in Table 1. Men were significantly older than women in subsample 1 (M men = 44.20 vs. M women = 41.05; SD men = 14.40 vs. SD women = 13.97; t = 3.05; df = 424.20; p = .002; d = .22), but not in subsample 2 (M men = 28.94 vs. M women = 27.97; SD men = 10.03 vs. SD women = 9.08; t = 1.58; df = 1836; p = .12; d = .10). Participants in subsample 2 had diverse educational backgrounds (18.3% basic schooling; 45.6% high school; 36.1% university level degree). Subsample 1 included more participants from a higher educational background (8.4% basic schooling; 32.1% high school; 59.4% university level degree). Participants were recruited via social media sites (i.e., Facebook) and e-mail lists from the local university. In both studies participants could participate in a lottery for compensation. The survey software reminded participants to respond in case of missing values, so there were no missing data. Participants were instructed that they could drop out of the study at any time and only data provided by participants who completed the entire survey were analyzed.

Measures
The German SPF-IRI includes 16 items measuring four dimensions of empathy with four items each. Answers were given on 5-point scales (1 = never to 5 = always). Reliabilities (Cronbach's alpha and McDonald's ordinal omega) were computed for the full sample, as well as for subsamples and subgroups of men and women, and are depicted in Table 1. Items were aggregated to scale sum scores, as there were no missing values.

Statistical analysis
I used SPSS 22 for descriptives, and Mplus 7.4 Muthén, 1998-2012) for confirmatory factor analyses (CFA) and exploratory structural equation modeling (ESEM). Model fit was evaluated using χ 2 tests (Bentler and Bonett, 1980), the comparative fit index (CFI) and Tucker-Lewis index (TLI) with values > .90 indicating acceptable model fit (Bentler, 1990;Hu and Bentler, 1999), the root mean square error of approximation (RMSEA) with values < .08 indicating acceptable model fit (Browne and Cudeck, 1993), and the standardized root mean square residual (SRMR) with values < .08 (Hu and Bentler, 1999) or .05 (Schumacker and Lomax, 2010) reflecting good fit. If models are based on the same data and variables, they can be compared using the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Lower scores indicate better model fit (Akaike, 1987). Differences greater than ± 10 are considered meaningful (Raftery, 1995). AIC emphasizes model accuracy and BIC provides the best trade-off between accuracy and parsimony. Robust Maximum Likelihood (MLR) was used for parameter estimation. I will also provide Raykov's composite reliability (CR; Raykov, 1997) as an SEM-based reliability estimate. McDonald's ordinal omega was computed using the "psych" package (Revelle, 2019) for the statistical software R (R Foundation for Statistical Computing, 2020). Ordinal omega can be computed based on polychoric correlations and should be better suited to estimate reliability for coarse ordinal scales (Gadermann, Guhn, and Zumbo, 2012;Viladrich, Angulo-Brunet, and Doval, 2017).
For measurement invariance (MI), models ares compared from one step to the next using χ 2 difference tests. MLR uses scaled χ 2 scores and therefore Satorra-Bentler scaled χ 2 difference tests have to be used (Satorra, 2000;Satorra and Bentler, 2001). χ 2 tests are greatly influenced by sample size and model fit indices are most dominantly used to judge model fit for MI. From one step to the next, a drop in CFI or TLI less or equal to .010 is conventionally considered acceptable unless there is a concurrent increase of RMSEA larger than .015 (Chen, 2007;Cheung and Rensvold, 2002). For ESEM, researchers are encouraged to look at RMSEA and TLI, because they include a stronger penalty for model complexity (Marsh, Nagengast, and Morin, 2013).

Results
Descriptive data analysis of SPF-IRI scores SPF-IRI descriptives, correlations, and reliability estimates are presented in Table 1 Table 2, ESEM reproduced the theoretically predicted item-to-factor patterns, yet substantial cross-loadings were evident.

Measurement invariance
Results of the MI tests are depicted in Table 3. The initial configural MI model (M1) fit the data well. Constraining factor loadings to be equal resulted in an improvement in model fit (metric invariance: M2). The comparably large difference can also be attributed to the large gain in df, as all factor loadings in ESEM are now set to be equal across groups. Curiously, the χ 2 value for the overall model slightly decreased at this stage. Constraining intercepts (scalar invariance: M3) and residuals (strict invariance: M4) showed a slight decrease in model fit that was within acceptable levels. An investigation of structural parameters confirmed that variances and covariances (M5) were also equal. The last model (M6) added equal factor means. In light of the initially shown scale mean differences, equality of latent means could not be expected. Indeed, model fit decreased markedly. On the basis of M5, latent mean differences could be estimated (unstandardized; female group with means fixed to zero). Men had lower means than women on  07; d = 0.08) barely failed to reach significance, with a slight tendency for men to score higher on PT than women. Taken together, MI testing indicated that measurement was comparable across gender groups.

Discussion
The present research investigated factorial validity and measurement invariance (MI) of the German version of the Interpersonal Reactivity Index SPF-IRI across gender groups. For the first time, a four-factor structure could be replicated for the SPF-IRI in line with the conception of empathy by Davis (1983). Across two subsamples ESEM was superior to CFA. Paulus (2009) documented cross-loadings for some items of the SPF-IRI during the scale construction. He named items 9, 11, and 14 as having strong cross-loadings. In both samples, I could not exactly reproduce cross-loadings for these specific items.
In the present study, cross-loadings were evident for several more items. Asparouhov et al. (2015) showed that cross-loadings >= .10 could bias estimates. Following these criteria nine out of 16 items in subsample 1 and ten out of 16 items in subsample 2 had noteworthy cross-loadings. Given that the majority of items showed substantial cross-loadings, ESEM appears to be clearly superior to CFA, in order to model the SPF-IRI. Notably, these results could be cross-validated in two separate samples. The overall item-to-factor patterns were found to be in accordance with the official structure. To further test the validity of the ESEM approach, I next examined MI across gender groups. Strict MI (factor loadings, intercepts, residuals) could be established. Additionally, all variances and covariances were equal. Reliability of the SPF-IRI was investigated using Cronbach's alpha, McDonald's ordinal omega, and Raykov's construct reliability. The reliability of the SPF-IRI was acceptable, but less than desirable, in line with prior investigations (Koller and Lamm, 2015;Paulus, 2009). Contrasting the original data by Paulus (2009), but mirroring the data presented by Koller and Lamm (Koller and Lamm, 2015), the Empathic Concern (EC) subscale emerged as the least reliable subscale. The present research is the first documented attempt to use CFA or ESEM with the German SPF-IRI, so results are harder to put into context. For other languages, CFA and even ESEM could not produce acceptable model fit in almost all cases. Most often, items had to be dropped (Garcia-Barrera, 2017) or item parceling was used (Hawk, 2013). I suggest that other researchers also try ESEM to investigate the factor structure of the IRI. Koller and Lamm (2015) conducted an analysis of the German SPF-IRI based on item response theory and found considerable misfit. They concluded that only the empathic concern (EC) subscale had acceptable validity and basically dismissed the personal distress (PD) subscale as "not very informative or reliable." The present research offers another, more positive, picture of the SPF-IRI. There is still an ongoing discussion as to whether empathy should be considered a multidimensional concept of correlated factors. Some researchers argue that a hierarchical model could be more appropriate (e.g., Cliffordson, 2001Cliffordson, , 2002. So far, empirical results have been inconclusive. Fernández et al. (2011) tested a second-order factor model and a model of four correlated factors using the Spanish version of IRI. Results showed a very slight advantage of the 4factor model, even though all models clearly failed conventional cut-offs for model fit. Further investigations of the structure of empathy may help to better understand the concept on a theoretical level. Cognitive and emotional aspects may still be too intertwined in the IRI scales. Based on the present study it is apparent that the SPF-IRI does not fully comply with a simple structure. The empathy concept by Davis (1980Davis ( , 1983 is organized based on social contexts, rather than psychological processes. IRI results thus present an inherent confound of cognitive and emotional processes across contexts that may be the source of the cross-loadings. Finally, differences emerged for latent means of men and women. Women scored higher on all dimensions of Note: Model fit: * p < .05, ** p < .001. Accepted model printed in bold. χ 2 values are Satorra-Bentler scaled. Equal parameters: 1 = loadings, 2 = intercepts, 3 = residuals, 4 = (co)variances, 5 = latent means empathy, except for perspective taking (PT). These findings are in line with prior research (Fernández et al. 2011;Gilet et al. 2013;Lucas-Molina et al. 2017).

Limitations
Despite the large sample, there was a substantial imbalance between genders. Nonetheless, these data included a sufficient number of men to conduct the CFAs and ESEMs. A core issue for cross-national comparisons can be found in the special German language version that includes only 16 items (Paulus, 2009). Ideally, measurement invariance should be tested for different language versions. Given that the German SPF-IRI does not retain all original 28 items, such an investigation is not easily possible. Future research should provide a newer empathy measure that addresses basic psychometric disadvantages of the current IRI and also allows for easier cross-national comparisons (Steenkamp and Baumgartner, 1998). I observed a slight drop in χ 2 values after putting equality constraints on factor loadings. It is generally unlikely that a model with more restrictions shows better absolute fit than a model with fewer restrictions. Due to its exploratory nature, ESEM could adapt to slight changes in model parameters as all factor loadings and all cross-loadings are set to be equal at this stage. At the level of factor extraction and rotation, a new factor solution could technically be found each time. ESEM has commonly been accepted for testing measurement invariance (Marsh et al., 2013) and researchers have been advised to use ESEM if a more traditional multigroup-CFA approach fails (Greiff and Scherer, 2018). Some argue that current methods for testing measurement invariance are all generally overly strict (Davidov, Muthen, and Schmidt, 2018). Future research might look into the possibility that ESEM might not be strict enough for testing metric invariance. This question, however, requires a more substantiated methodological investigation that goes beyond the scope of the present research. In the present case, I still consider ESEM to be appropriate and the MI results reliable, because ESEM provided much better fit compared to CFA, and MI even beyond metric MI could be supported. Still, users should be aware that ESEM has its disadvantages, including the unclear interpretation of factors, increased number of model parameters and thus increased required sample sizes.

Conclusions
The theoretically predicted 4-factor structure could be replicated. ESEM was superior to CFA for modeling the SPF-IRI due to the existence of crossloadings. Strict measurement invariance could be established across gender groups and measurement was comparable for women and men. The factorial validity of the SPF-IRI could be supported. The heterogeneity of empathy and the unclear differentiation between cognitive and emotional aspects might be a source for the unclear differentiation of scales.