Health and Wellbeing: Past Events/Training

The 8th annual UK Rasch User Group Day 21st March 2014 at the University of York

Introduced by Tim Croudace (Health Sciences and HYMS)

Morning Session: 11am – 1pm

Chair: Jan R. Böhnke, Health Sciences and HYMS

See Abstracts Below:-

Foreign Language Classroom Anxiety Inventory (FLCAI): a new scale for measuring FLCAPanayiotis Panayides; Miranda Jane Walker, Secondary Mathematics Education, Cyprus ( study is divided into two parts. In the first part the psychometric properties of the Foreign Language Classroom Anxiety Scale (FLCAS) were investigated for Cypriot senior high school EFL students. The FLCAS (Horwitz, Horwitz & Cope, 1986) is a well-established widely used scale. Results showed that after removing five items which fitted the Rasch Rating Scale model badly, the remaining 28 items form a unidimensional scale. The degree of reliability was, in the opinion of the researchers, questionably high. Semantic analysis of the items revealed that one of the reasons was the inclusion of many parallel items. Further analyses revealed that a second reason was the narrow coverage of the construct by the items. Finally the 5-point Likert scale was shown to be marginally optimal.In the second part a new scale for measuring FLCA was constructed. It begun with the creation of an extended item pool (39 items) generated using qualitative methods. Subsequent Rasch and semantic analyses led to the final 18-item Foreign Language Classroom Anxiety Inventory (FLCAI).In comparison with the Foreign Language Classroom Anxiety Scale (FLCAS), the FLCAI demonstrated more convincing evidence of unidimensionality and the optimal 5-point Likert scale functioned better. The FLCAI, while 55% the length of the FLCAS, thus more practical to administer, maintains its psychometric properties and covers a wider range on the construct continuum thus improving the validity of the instrument. ReferencesHorwitz, E.K., Horwitz, M.B., and Cope, J. (1986). Foreign language classroom anxiety. The Modern Language Journal, 70(2), 125-132.


Rasch analysis of the Spanish version of the Mindful Attention Awareness Scale (MAAS) in a clinical sampleFelix Inchausti, Andrew Bateman, Joe Mole & Miren BarrainkuaUniversidad de Salamanca, Spain & Oliver Zangwill Centre for neuropsychological rehabilitation, Ely, UK. ( The use of mindfulness in clinical practice is becoming increasingly popular and the Mindful Attention Awareness Scale (MAAS) is one of the most frequently used tools for measuring it. The aim of this study was to test the effectiveness of mindfulness training and to analyse the psychometric properties of the MAAS in a clinical sample, using the Rasch model.Methods: 199 participants with clinical symptoms of mood and anxiety were recruited. The experimental group (N = 103) received 12 weeks of mindfulness training and the control group (N = 96) received a conventional outpatient treatment of the same duration. The MAAS scores were analysed before and after both interventions to test the effectiveness of mindfulness training, the psychometric properties of the MAAS scores and differential item functioning (DIF) using the Rating Scale Model (RSM).Results: Misfit in items 9 and 12, DIF in item 9 and problems with the Spanish translation in the items 5, 9 and 12 were observed. The analysis was repeated with these items excluded and it was found that a short version, MAAS-12, yielded appropriate dimensionality, fit and reliability.Conclusions: In contrast to previous studies, the MAAS was found to be sensitive to mindfulness training. However, the commonly used MAAS has some limitations associated with poor Spanish translation and its psychometric properties and should be revised. The MAAS-12 was found to be a more accurate scale than MAAS but suffered from construct under-representation. In order to appropriately measure all key facets of mindfulness, it will be necessary to construct a tool based on a coherent theoretical perspective.Keywords: Mindfulness; Mindful Attention Awareness Scale; Meditation; Rasch model, item response theory (TRI).


Clinical Use of the Euroqol EQ5D-5L in Community Rehabilitation and Musculoskeletal Physiotherapy services: Item ordering, Item Bias and Disordered Thresholds identified using  Rasch AnalysesAuthors: Andrew Bateman (1,2) Felix Inchausti(1,3) Ian Moyes (1), Karen Fechter (1) Stephanie O’Connell(1) (1)Cambridgeshire Community Services NHS Trust (2)  Dept of Psychiatry, University of Cambridge ; (3) University Hospital of BadajozIntroduction: Health-related-quality-of-life (HRQoL) is often reported using single-digit indices derived from preference–based instruments such as the EQ5D. This widely used generic HRQoL instrument presents 5 health items (indicators) Mobility(MO), Self-care(SC), Usual activities (UA), Pain/discomfort(PD) and Anxiety/depression(AD). The EQ5D was recently revised from 3 level to 5 response categories (hence 5L). Rasch analyses were selected to investigate the properties of the tool and assist in describing the rehabilitation service users.Method: Electronic Health Record database entries were interrogated for EQ5D information and analysed using standard Rasch protocols. Data were partitioned by gender and into three age groups (<60;60-80;80+).Results: For Community Rehabilitation, data were available from 1906 patients (average 74+/-17yrs, range17-102, 39%Male) over 10 months from July2012.  Ordering of item difficulty was as follows UA,MO,SC,PD,AD (logit range -1to+0.8) and inspection of item-threshold plots indicate that it shows impressive targetting (all but 4% of the sample). Three of five questions showed disordered thresholds (SC, UA, AD) that merited re-scoring (collapsing levels 4 and 5). Significant Differential Item Functioning (DIF) by age was noted for MO, UA and AD (F>8.5, d.f.=2,p<0.0002). DIF by gender was noted for PD and AD (F>11,d.f.=1,p<0.0008). Further assessments of validity and measurement properties were undertaken by comparing this data set with an analysis of  426 patients attending musculoskeletal physiotherapy appointments. In this group Pain/Discomfort is more readily endorsed, followed in order by UA, MO, AD & SC. Disordered thresholds were also similarly observed in SC,UA & AD.Discussion: The order of item difficulties reflect the priorities of community rehabilitation –ie people seeking support to return to Usual Activities and improve Mobility. Likewise MSK physio services are focussed on treatment of pain and it is appropriate that this patient group endorse this question most readily. Sources of misfit in the tool are important to recognise and guide further analysis.  Work in progress includes opportunities to examine the costs of interventions, the change in health status over the time seen by our services. Use of RMSEA as an approach to examining fit is planned given the relatively large n data available to us.Conclusion: The EQ5D-5L has merit as a generic patient reported outcome measure, but analysis of data captured using it can benefit from using this modern approach to analysis.


A Rasch analysis of the EQ-5D from the Hospital Episode Statistics: Is the instrument fit for purpose?Adam B. Smith, York Health Economics Consortium Ltd., Market Square, University of York YO10 5NHBackground & AimThe EuroQol-5D (EQ-5D) is a generic patient-reported outcome measure (PROM) allowing comparisons to be made across different diseases and conditions. The instrument has been used in the UK’s National Health Service (NHS) since 2008 to collect data from patients to assess the effectiveness of a number of surgical interventions and is used by the National Institute for Health and Care Excellence (NICE) for cost-effectiveness calculations. However, despite the widespread use of the EQ-5D it has not been subjected to rigorous scrutiny to determine whether its psychometric properties hold across different patient populations. The aim of this study was to investigate the psychometric properties of the EQ-5D using the Rasch Model.MethodsThe data were derived from published Hospital Episode Statistics (HES) for March 2013. The EQ-5D had been completed by patients undergoing four surgical procedures: groin hernia repair (GH, N=21831), hip (HR, N=37800) and knee replacement (KR, N=40429) and varicose vein repair (VV, N=4681). The partial credit model (Masters, 1982) was applied to the individual datasets. Category disordering, fit statistics (infit mean square (MNSQ) < 1.30), unidimensionality (principal components analysis, eigenvalue 1st contrast < 2.0), and person separation / reliability were determined (criterion >2 and >0.8 respectively). A random sample (N=~5000) was drawn from each dataset to assess differential item functioning (DIF) across the four medical conditions (criterion difference > 0.5 logits).ResultsNo category disordering was observed. Misfit was observed for the Anxiety/Depression domain for GH (MNSQ = 1.39) and VV repair (MNSQ = 1.38) groups. The eigenvalues for the 1st contrast were: 1.5 (GH), 1.4 (HR), 1.3 (KR) and 1.4 (VV) indicating there was no further dimensionality present. Person separation/ reliability was 0.73 / 0.35 for GH, 1.35 / 0.63 HR, 1.06 / 0.53 KR and 0.91 / 0.45 VV suggesting that the EQ-5D is not sensitive enough to differentiate between different levels of latent trait. There was significant DIF between the 4 treatment groups: 50% of all possible contrasts showed DIF. The only domain not affected by DIF was Discomfort/Pain. There was DIF present in 2/3 of the contrasts for Anxiety/Depression, Mobility and Self-care and in 50% of the Usual Activities domain.ConclusionThe most important result of this study is the finding that the EQ-5D performs differentially depending on the patient group meaning that, at least for the four patient groups investigated here, the instrument should not be used to draw comparisons across different surgical interventions. This has potentially significant ramifications for the use of the instrument as a measure of efficacy in the NHS and for cost-effectiveness by NICE.

Afternoon Session: 2pm – 4.30pm

Chair: Adam B Smith, York Health Economics Consortium Ltd

Blatant violation of independence: Should statistical tests be conducted on comparative judgement data?Ian Jones, Matthew Inglis, David Sirl, Mathematics Education Centre, Loughborough University ( judgement approaches to measuring achievement and monitoring standards are gaining popularity in education. Comparative judgement involves no marking, and instead assessors decide which is the “better” of presented pairs of scripts. The pairwise judgement decisions are fitted to a logistic model to produce a parameter estimate representing relative quality for each script. These parameter estimates are commonly used for routine assessment purposes such as statistical tests of the difference in performance between two forms of an examination. But is this warranted? Comparative judgement blatantly violates the assumption of independence because every decision made is an outcome of the relative “true score” and assessor error of two scripts. We might therefore expect the comparative judgement approach to educational assessment to increase the likelihood of Type I errors. In this presentation we will present the outcomes of a simulation study in which we compared statistical tests conducted on true scores with tests conducted on comparative judgement parameter estimates. We conclude that any inflated differences due to the violation of independence are negligible and of no practical concern.


Comparability of examination papers using paired comparisonsMalcolm Hayes, Pearson Qualification Services ( summer 2013 Edexcel introduced a set of alternative papers for International GCE and GCSE subjects in response to the risks associated with the security and integrity of examination papers being taken across multiple time zones. These alternative papers posed a challenge with regard to awarding due to the lack of prior attainment data for international candidates. Clearly, our aim is to ensure that there is no advantage to having sat the main or the alternative paper. To achieve this aim we undertook comparability studies to inform the award of our international time zone papers.The awarding for the UK paper used prior attainment data and was conducted in accordance with both the Code of Practice and other regulatory requirements applicable to the series. Provisional grade boundaries were determined for the alternative papers based on a combination of professional opinion and available performance data.The comparability studies involved sets of expert judges being asked to rank-order sets of ten scripts, five from each of the main and alternative papers and covering a range of marks either side of the appropriate grade boundary.These ranks were converted to pairwise comparisons and analysed using FACETS©. The resulting measures were then compared with the distance of the script mark from the grade boundary and the results used to align the boundary on the alternative paper with that on the main paper. The recommendations were then used to inform the awarding on the alternative paper.This presentation will discuss some practical considerations, illustrate how the process worked and outline areas for further investigation.


Maximum Likelihood Estimation of Rasch model parametersAlex McKee, Pearson Qualification Services ( this presentation I outline the general procedure for conducting a Maximum Likelihood Estimation (MLE). I identify in detail the stages involved in a MLE procedure while using simple examples to illustrate the ideas. Thereafter I detail the procedure for using Maximum Likelihood Estimation to estimate the item parameters of the Rasch (logistic) model. Finally, from scratch, I use selected item level data and go through the stages of MLE using the Rasch model to manually obtain the item parameters. I compare the result with that obtained from selected software then conclude.


Statistical procedures for detecting unstable anchor itemsMuhammad Naveed Khalid, Mark Elliott, Cambridge English language Assessment( The presence of outlying anchor items is an issue faced by many testing agencies. The existence of outlying anchor items indicates that an item is functioning differently for the two groups of examinees who have taken the assessments to be equated; however, the decision to retain or remove an item is a difficult one, especially when the content representation of the anchor test becomes questionable by removing items. In this study, we address the limitations of an existing set of criteria based on Chi-squared fit statistics and propose an objective and reliable index to determine which anchor items are problematic and should be discarded.


Reversed thresholds in partial credit models – A reason for collapsing categories?Eunike Wetzel and Claus H. Carstensen, University of Tübingen & University of Bamberg, Germany ( questionnaire data with an ordered polytomous response format are analyzed in the framework of item response theory using the partial credit model or the generalized partial credit model, reversed thresholds may occur. This led to the discussion of whether reversed thresholds violate model assumptions and indicate disordering of the response categories. Adams, Wu, and Wilson (2012) showed that reversed thresholds are merely a consequence of low frequencies in the categories concerned and that they do not impact the order of the rating scale. This study applies an empirical approach to elucidate the topic of reversed thresholds using data from the NEO-PI-R as well as a simulation study. It is shown that categories differentiate between participants with different trait levels despite reversed thresholds and that category disordering can be analyzed independently of the ordering of the thresholds. Furthermore, we show that reversed thresholds often only occur in subgroups of participants. Thus, researchers should think more carefully about collapsing categories due to reversed thresholds.


Advances in Product Engineering Using the Rasch Model

Fabio R Camargo and Brian Henson, University of Leeds, School of Mechanical Engineering, Institute of Design, Robotics and Optimisation (

Product experience is one of the most fascinating aspects of the interaction of people with physical objects. Since the 1970s, engineers and designers have used statistical methods to transform the users’ experience into physical parameters to manufacture products. However, product experience is frequently idiosyncratic, culturally located in the consumer’s values and dependent on the influence of social groups. As a consequence, comparisons between different studies and generalisation of results have been a challenge. A different approach is the construction of metrics for relating users’ experience in the real world (observed) to the affective attribute of a product (latent). The Rasch model has been adapted for this purpose. In 2005, using techniques from affective engineering (AE), a study at the UK arm of an international confectionery company pointed to difficulties that should be tackled to achieve measurement outcomes. Successful results to measure ‘specialness’ of confectionery in a laboratory environment were achieved in 2010. Nevertheless, difficulties of application related to local dependence were demonstrated. Later, using a study on the compliance of packaging, the many-facet Rasch model was adapted to take into account the requirements of AE process. In 2013, an agreement of co-operation between Toyota-Boshoku Corporation and the University of Leeds allowed an application of the approach for fabric of car seats, closing the loop between the contribution of academic knowledge and practical outcomes. However, the complexity of constructing scales for product experience is still a barrier that prevents companies from reaping the benefits of measurement. 

UPDATE (posted March 2014)

Tim Croudace, York