The Psychometric Properties of an Intrinsic Motivation Scale in Conducting Research: The Application of Rasch Measurement Model

Research culture a system that posits importance to conducting and communicating scholarly research, is highly expected from academic staff in the higher learning institutions. Skillful academics with high levels of (intrinsic) motivation in conducting research are most likely needed to achieve the desired research culture. Therefore, a research project has been conducted to examine the possibility of understanding and promoting research culture at a public university in Malaysia. Distinctive instruments were developed to achieve the purpose. This paper, thus, is aimed at examining the psychometric properties of a 44-item survey instrument developed to delve into the academic staff’s intrinsic motivation in conducting research. The survey was administered to 326 academic staff from various faculties in a public university. The Rasch Measurement Model, which provides evidences on the fundamental measurement requirements of research instrument, was used to analyse the psychometric properties of the survey instrument using the Winsteps software program. The results of the Rasch analysis combined with qualitative investigation showed that the survey measured two distinct subscales or sub-dimensions of intrinsic motivation (namely positive and negative), which should be carried out separately in the final analysis. The resulted two subscales met the measurement requirements as evidenced by the individual analyses of the Rasch Model. Three misfit items were deleted from the second subscale (i.e. positive). Further items could be added to the two subscales to target the respondents with high ability measures. Recommendations were also given to revise the 5-point Likert scale used in the surveys in other related studies.


Motivation and Intrinsic Motivation
To be motivated means to be moved to do something (Ryan and Deci, 2000). A motivated person is a person who feels, stimulates, inspires and energizes to do something and to work towards an end. In theory, if the performance is more attractive, the greater will be the effort of an individual. The performance is more attractive when it is very difficult and in contrast, it will be the least attractive if it is easy to achieve (Costin, 2014). Thus, motivation has been recognized as one of the major factors in improving individuals" performances (Guilloteaux and Dornyei, 2008).
Motivation can be divided into two parts namely intrinsic and extrinsic motivations. Extrinsic motivation refers to performance of a task in order to receive observable and tangible reward, recognition, respect and appreciation. In theory, extrinsic motivation constitutes four elements which are; 1) desire for affiliation; 2) fear of consequences; 3) ambitions; and 4) normative trend (Moldovan, 2014). The author also suggests that there is the possibility of the transformation of extrinsic motivation into intrinsic motivation through exposure and stimulations of the subjects" environment. Intrinsic motivation is the doing of an activity for its inherent satisfactions and enjoyment for reasons that lie within the activity itself rather than its consequences (Ryan and Deci, 2000;Wigfield and Eccles, 1992). As supported by (Cerasoli et al., 2014) and (Pinder, 2011) on the importance of intrinsic motivation as not being instrumental towards other object of value rather than it motivates behaviors to be enjoyable, purposive and provide sufficient reasons for an individual to persist. In an academic situation, intrinsic motivation leads to a profounder processing, greater mastery, and better implementation of strategies (Covington, 2000). Thus, intrinsic motivation is a very significant drive for performing achievement-related activities because erudition comes as a by-product of engaging in an enjoyed task and one"s self-determination. Evidently, intrinsic motivation remains a strong predictor of performance regardless to the availability of incentives (Cerasoli et al., 2014). (Hannam and Narayan, 2015) assert that intrinsic motivation helps individuals to have positive perception towards their nature of work, more creative and imaginative minds in solving problems. According to Shalley et al. (1987), individuals demonstrate high intrinsic motivation when they attain difficult goals and anticipate no external evaluation. This is to show that individual"s intrinsic motivations depends on the difficulty of the goals rather than expectations.

Objective of the Study
The purpose of this paper is to provide a psychometric analysis of a scale used to measure the intrinsic motivation using the Rasch measurement model. The scale consists of 44 items related to intrinsic motivation on a five-point Likert-type scale ranged from Strongly Disagree (1), Disagree (2), Neutral (3), Agree (4), and Strongly Agree (5) completed by university academic staff.

Research Method
The research survey design was utilized in this study. A 44-item survey was individually administered to 326 academic staff randomly selected from various faculties in a public university in Malaysia to collect data on their intrinsic motivation in conducting research (Creswell, 2013). The survey was developed based on a thorough literature review and interviews with academic staff in psychology and education. Several items were added and others were modified in line with the feedback given by the academic staff in the content and face validity process, given a total of 44 items listed in the final survey instrument. For the sample size, when using the Rasch Model it is recommended that the most reliable interpretation comes from a sample closer to 100 (Bond and Fox, 2015;Green and Frantom, 2002;Linacre, 1994;Wright and Stone, 1979). In this study, the one parameter Rasch modeling for polytomous data was used to examine the psychometric properties of the instrument. As mentioned earlier, Rasch Modeling converts the ordinal raw data collected through surveys into interval measures (Bond and Fox, 2015;Curtis and Boman, 2007). It provides necessary diagnostic information about the instruments at the macro level (e.g. item and scale reliability indices); meso level (e.g. location parameters and item fit indices); and micro levels (e.g. individual item thresholds through their locations and standard errors (Curtis and Boman, 2007) . The Rasch analysis was conducted using Winsteps software, 3.72.1 (Linacre, 2011). The results were depicted in Tables and Figures. It is important to highlight that the negative items were recoded in the first run. However, many of the items showed misfit values and negative correlations. Below is the analysis for all the survey items without recoding, followed by the analysis for the resulted two scales (negative, 20 items and positive, 24 items).

Screening and Cleaning Data
Each survey was examined to ensure the integrity of the data collected. The incomplete surveys were excluded from the analysis. Then the individual responses and demographic information were keyed into SPSS file (version 16) and checked for data entry errors (Pallant, 2013). The final SPSS data set was imported to Winsteps software program, 3.72.1 for Rasch modelling (Linacre, 2011).

Adequacy of Research Instrument -Overall
The initial analyses of the collected data were conducted to check the adequacy of scale measurement developed to inquire data on intrinsic motivation in conducting research among academic staff in a public university. Item and person reliability indices were examined followed by determining the validity of items through three indicators: Item polarity, Item Fit, and Unidimensionality (Bond and Fox, 2015). In case any sub-dimension is identified, a separate measure for this dimension should be created (Bond and Fox, 2015). It is important to mention that the negative items were recoded, but no improvement on the measures were observed. Below is the analysis for all items; followed by the resulted two sub-scales.
Based on summary statistics (after deleting most misfit persons), Table 1 shows that reliability of item difficulty measures is very high (1.00). This suggests that the ordering of item difficulty is highly replicable with other comparable sample. The item separation index was > 2. Table 1 also reveals that the person reliability is fair .74, suggesting that it is likely high that the ordering of students can be replicated with other items of the same difficulty. The person separation index is 1.69, indicating that the items can divide respondents into two levels. Table 1 also displays item polarity and item fit statistics.
The Item polarity (i.e. point-measure correlation coefficient) indicates the extent to which scale items are working in the same direction to define the measured construct. Negative and zero values indicate that items or examinees are working in the wrong direction, and relatively high positive values are desired (Linacre, 2010). Table  1 shows that the point measure correlation (PTMEA CORR.) for the 44 items are positive, but some items are below 0.3 (between 0.12 -0.29).
Fit statistics are used to ensure that the items are contributing meaningfully to the measurement of the variable or construct as expected by the model (Bond and Fox, 2015). The two major fit statistics used are the infit and outfit Mean-square statistics. These statistics indicate the amount of "distortion of measurement system" (Linacre, 2010). Among the recommended fit statistics ranges for scale items (0.5-1.5). The items within the recommended range are considered productive or meaningful to the measurement; values below this range indicate that the items are considered as over fitting, while those above this range are considered as misfitting (Bond and Fox, 2015;. Table 1 shows that the infit and outfit mean-square of individual items are within the specified range (0.5 -1.5), indicating that they were working as expected by the model. In the Rasch model, the data must fit the model usefully and items must work together to measure a single unidimensional construct. The principal component analysis of residuals is used to test unidimensionality. Table 2 shows that unidimensionality is violated. Though the variance explained by the measure is 51% (accepted), the factor analysis of the residuals in the first contrast indicates high unexplained variance was (10.9 units or 12.1 %). Meaning that the largest factor extracted from the residuals was equivalent to 10.9 units which have the strength of about 11 items (more than 3 items needed to be considered as a second factor) (Linacre, 2010). So, the PCA likely defined two subscales of sub-dimensions of the scale as shown in Table 2.
Remedial solutions were conducted as deleting misfit persons and few items and recoding the negative items. No improvements were observed. Therefore, the recommendation was to analyze the scale separately. Further qualitative investigation carried out by experts in psychology, highlighted that this scale has two subscales or subdimensions. The first dimension with 24 items is representing the positive items, and the other one with 20 items is representing the negative items.
The category statistics of the scale showed that all categories were used by the respondents more than recommended, > 10 ( Bond and Fox, 2015). The average measure increases monotonically across the categories. The outfit mean squares are less than recommended value < 2 (Bond and Fox, 2015), indicating no noise in the measurement process. However, distances between the adjacent thresholds are less than1.4 (Bond and Fox, 2015), which needs more investigation. It is also noticed that the mid-category (neutral) is used in the scales very often. It might attract many of the respondents from different abilities.
From the analysis above, it could be concluded that the scale items are related to intrinsic motivation and have meaningful contribution to the construct. Nevertheless, the PCA analyses showed violation in scale unidimensionality. The scale could be measuring two sub-dimensions (positive and negative). Other solutions were performed to examine the unidimensionality by deleting items and misfit persons, rescoring items, and collapsing categories, but no improvement was observed. Qualitative investigation by expert in psychology asserted that the items are measuring two sub-scales (positive and negative) Having the initial Rasch analysis of the overall scale shown that the scale consists of two dimensions, it is recommended to analyze each scale individually for better measurement (Bond and Fox, 2015). Below are the analyses of the two scale; (negative with 20 items and positive with 24 items respectively).

First Subscale (Negative Items)
The Rasch Model analyses of the 20-item subscale were conducted to check the psychometric measurement properties of this subscale. Item and person reliability indices were examined followed by determining the validity of items through three indicators: Item polarity, Item Fit, and Unidimensionality (Bond and Fox, 2015). Table 3 shows that reliability of item difficulty measures is very high (.99). This suggests that the ordering of item difficulty is highly replicable with other comparable sample. The item separation index is > 2. Table 3 also reveals that the person reliability is .83, suggesting that it is likely high that the ordering of students can be replicated with other items of the same difficulty. The person separation index is 2.23, indicating that the items can divide respondents into two levels. Table 3 also displays item polarity and item fit statistics. Table 3 shows that the point measure correlation (PTMEA CORR.) for the 20 items are positive, ranged (.28-.72), indicating the items are working in the same direction to define the measured construct. Table 3 also shows that the infit and outfit mean square of individual items are within the specified range (0.5 -1.5), indicating that there working as expected by the model. However, only one item has outfit mean square (1.54) due to misfit persons. The analysis ensures that the items are contributing meaningfully to the measurement of the construct as expected by the model (Bond and Fox, 2015). The principal component analysis of residuals is used to test unidimensionality i.e. to ensure that data fit the model usefully and items are working together to measure a single unidimensional construct. Table 4 shows that unidimensionality is not violated. The variance explained by the measure is 44.2, and the largest factor extracted from the residuals is equivalent to 2.6 units which have the strength of about 3 items (Linacre, 2010).  (Item-Person Map) shows the distribution of all items and persons on one logit scale. The item difficulty measure spanned from -1.50 logits to 1.10 logit, while the person ability measure spanned from -2.71 to 2.12 logits. There were no significant visible gaps between item distribution, except for two gaps at the upper and lower ends of the scale. Looking at the overlapping items, it is found that those item are not measuring the same aspects of the measured constructs. It is recommended to add a few more items at the top and end of the scale for more precise person estimates. Overall, it could be noticed that targeting of this sample"s levels has been well. The hierarchy of the items is almost as expected due to qualitative investigation with experts in psychology i.e. the order of the items make sense in this context. For the category functions, the category use statistics (i.e. category frequencies and average measures) for each options were examined. The former indicates how many respondents actually chose any particular response category. A recommended minimum number of responses per category is 10. Table 5 shows that each category has met this criterion. The average measures are defined as the average of the ability estimates for all persons in the sample who chose that particular response category, with the average calculated across all observations in that category as cited in Bond and Fox (2015). They further elaborated that these average measures should increase monotonically. Table 5 shows that the average measures for this scale are increasing monotonically (-1.72 < -0.86 < -0.25 < 0.40). The thresholds are also increasing monotonically (-1.66 < -.22 < 0.23 < 1.65). However, the distance between adjacent categories, mainly between categories 3 and 4 is less than 1.4 (Bond and Fox, 2015). These could be collapsed for better improvement (Bond and Fox, 2015;Green and Frantom, 2002;Linacre, 2002). For the fit statistics, they provide another criterion for assessing the quality of rating scales. Table 5 the fit statistics of each rating scale are less than 2, indicating that there is no noise in the measurement process.
The same analysis process was utilized to examine the items defined the second subscale (positive items).

Second Subscale (Positive Items)
The Rasch Model analyses of the 24-item subscale (positive) was conducted to check psychometric measurement properties of the scale. Item and person reliability indices were examined followed by determining the validity of items through three indicators: Item polarity, Item Fit, and Unidimensionality.
It is essential first to highlight that the analysis for this subscale showed that the there is a problem in the category functions. The respondents seem that they did not use the categories as intended as shown in Table 6 and  Table 7. Table 6 shows that though the category frequencies have more than 10 responses as recommended by Bond and Fox (2015), the average measures do not increase monotonically (0.97 > 0.24 < 0.60 < 1.38 < 2.47 across the rating scale response categories (1,2,3,4,5). In addition, the outfit mean squares for category 1 is greater than 2, indicating that this category is introducing more noise more than meaning into the measurement (Bond and Fox, 2015). Table- According to Bond and Fox (2015) it is recommended to collapse adjacent categories (i.e. 1 and 2) to improve the variable interpretation. The collapsing categories 1 and 2 has improved the rating scale diagnostics as shown in Table 7.  Having deleted the three misfit items, the reliability of item difficulty measures is still very high (.99) as shown in Table 8. This suggests that the ordering of item difficulty is highly replicable with other comparable sample. The item separation index is > 2. Table 8 also reveals that the person reliability is .87, suggesting that it is likely high that the ordering of students can be replicated with other items of the same difficulty. The person separation index is 2.59, indicating that the items can divide respondents into two levels. Table 8 also displays item polarity and item fit statistics.  Table 8 shows that the point measure correlation (PTMEA CORR.) for the 24 items are positive, ranged (.41-.73), indicating the items are working in the same direction to define the measured construct. Table 8 also shows that the infit and outfit mean-square of 21 individual items are within the specified range (0.5 -1.5), indicating that they are working as expected by the model. However, three items, namely 18,34,36 have infit and outfit mean squares above (1.5), which were deleted from the analysis. Qualitative investigation of these items showed that item 18 (did not feel nervous) has double negations, which might cause confusion/difficulty to the participants to endorse this item; while item 34 (benefits in monetary) might be related to extrinsic motivation rather than intrinsic motivation. For item 36 (value research than publication), the participants might be confused about the difference between publication and research in this item or they might value both research and publications.
The principal component analysis of residuals is used to test unidimensionality i.e. to ensure that data fit the model usefully and items are working together to measure a single unidimensional construct. Table 9 shows that unidimensionality is not violated. When deleted persons with negative correlation, a little improvement was observed as shown in Table 9. The variance explained by the measure is 43.9 and the largest factor extracted from the residuals is equivalent to 2.8 units which have the strength of about 3 items (Linacre, 2010). Unexplained variance in 1st contrast 2.8 7.50% 13.40% Figure 2 (Item-Person Map) shows the distribution of all items and persons on one logit scale. The item difficulty measure spanned from -1.09 logits to 1.76 logit, while the person ability measure spanned from -1.25 to 6.7 logits. There was a significant visible gap at the upper end of the scale. Looking at the overlapping items, it is found that those item are not measuring the same aspects of the measured constructs. It is recommended to add a few more items at the top end of the scale for more precise person estimates. Overall, the hierarchy of the items is almost as expected due to qualitative investigation with experts in psychology.
The analyses also showed that there are respondents with negative correlation and infit and outfit mean squares above the recommended value, 1.5. They were deleted for better measurement. As a result, little improvement was observed mainly on the variance explained by measures.

Implications and Conclusions
In order to make research culture as a success, the researchers must be willing to commit themselves towards the mission of the HEIs. Intrinsic motivation is a good indicator to show the level of commitment of researchers and whether the research culture will be nurtured and established within institutions of higher learning. There is a need for academic staff who possess research skills, and with high levels of (intrinsic) motivation in conducting research to promote a research culture at universities. Valid and reliable instruments to measure the staff intrinsic motivation are highly demanded to provide more accurate and reliable results. The initial analysis of Rasch Model analysis showed that the survey instrument used to measure the intrinsic motivation among academic staff was measuring two subscales or sub-dimensions of intrinsic motivation (positive and negative). The individual Rasch Model analysis of the two scales met the fundamental measurement requirements in terms of person and item reliability and separation indices, item correlation, fit statistics, unidimensionality, and targeting. However, three misfit items were deleted from the scale with positive items. Overall, the hierarchical order of the items that defined the two subscales make sense and sounds reasonable. It has been recommended to add further items mainly at the top and bottom ends of the scales to target the respondents with high and low ability measures. Recommendations were also given to revise the 5-point Likert scale used in the surveys in other related studies.