Prof. Barry O’Sullivan
Speech title
Artificial Intelligence and Language Assessment: An ethics and accessibility perspective
Abstract
While test developers have long embraced technology in their development, delivery and scoring processes, the recent popularity of artificial intelligence, particularly in the form of large language models (LLMs), has brought with it exciting opportunities. However, the new technologies have also highlighted a number of significant challenges, some of which pre-dated the current technology cycle while others are genuinely new.
In this talk, I present an overview of a major initiative undertaken by the British Council to create interactive language speaking tasks to a set specification using LLM technology. The first of these tasks were made available to learners for formative purposes in April 2024. While the talk will present insights into the approach taken, the primary focus will be on ethical and accessibility issues that emerged during the process.
In dealing with the ethical side of the application of AI in assessment, we realised from the outset, that we did not have sufficient expertise. With this in mind, we partnered with the Knowledge Lab at University College London who worked with us to develop a practical (if complex) tool which was used to perform an ethics audit of all aspects of our approach and our aims. This proved invaluable in helping us devise an ethics-by-design approach. Within this approach, we also focused on ensuring the accessibility of our final product, taking the British Council’s Justice, Equality, Diversity and Inclusion (JEDI) policies and practices into account. Uniquely, we have two formal written policies, one is person-centred (e.g. special accommodations) and the other is test-centred (e.g. representation in test content).
The talk concludes by stressing the importance of the approach taken to the application of AI in language assessment, while also highlighting the dangers of ignoring or relegating the importance of ethics and accessibility within such test development and delivery contexts.
Biodata
Barry O’Sullivan (Director of British Council English Language Research) has published and presented widely on language testing related topics. He advises internationally on learning system reform and policy, while leading the British Council’s research and development programme focusing on the application of artificial intelligence to language teaching, learning and assessment. Barry is the founding president of the UK Association for Language Testing and Assessment and is visiting professor at the University of Reading, and special advisory professor at Shanghai Jiao Tong University. His was awarded fellowships by the UK Academy of Social Sciences (2016) and the Asian Association for Language Assessment (2017). In 2019 he was awarded an OBE by the UK government for his contribution to international language testing.
Prof. Micheline Chalhoub-Deville
Speech title
Enhancing Professionalism in Language Testing and Assessment: A focus on guiding theories, key documents, and established practices
Abstract
The conference aims to advance professionalism in language testing and assessment, a theme intertwined with the field’s core interests. Increasingly, language testing and assessment theories and practices are driven by the pursuit of professionalism and social responsibility. This pursuit necessitates an exploration of guiding documents in the field. In my talk, I explore the efficacy of existing ethical codes, professional standards, and validation frameworks to help cultivate a professional environment conducive to enhancing professional conduct and upholding strategies for developing and administering testing systems that are fair, accessible, and equitable.
Existing ethical codes, practice guidelines, and professional standards are indispensable—offering principles to maintain integrity and quality in language testing. However, there are issues to consider regarding such frameworks. We need to examine the extent to which such documents reflect cultural expectations surrounding quality assurance, fairness, washback, and social justice. Furthermore, the effectiveness of these documents needs to be evaluated in terms of their ability to address emerging challenges and technologies, as well as their enforceability.
Advancing professional and socially responsible practices requires proactive measures at both individual and organizational levels. By fostering principles such as transparency, culturally-responsive practices, accountability through peer review, continuous learning and professional development, the field can mitigate inherent biases and disparities in assessment practices. Collaboration among stakeholders further strengthens the collective commitment to fairness and equity. Moreover, transparent technical documentation and adherence to established standards can help facilitate scholarly inquiry into our testing practices and their implications.
Finally, professional bodies like the International Language Testing Association (ILTA) and the Asian Association of Language Assessment (AALA) can play indispensable roles in advocating for inclusive policies and supporting practitioners through training and resources. Their contributions are vital in promoting professionalism and fostering an environment of fairness and equity in language testing and assessment.
Biodata
Professor Micheline Chalhoub-Deville has published, presented, and consulted worldwide on topics such as computer adaptive tests, K-12 academic English language assessment, admissions language exams, and validation. Her scholarship has been recognized through awards such as the TOEFL Outstanding Young Scholar Award, the national Center for Applied Linguistics Charles A. Ferguson Award for Outstanding Scholarship, and the 2024 Samuel J. Messick Memorial Lecture Award.
She is Past President of the International Language Testing Association (ILTA). She is Founder and first President of the Mid-West Association of Language Testers (MwALT) and is a founding member of the British Council Assessment Advisory Board and the Duolingo English Test (DET) Technical Advisory Board. She is a former Chair of the TOEFL Committee of Examiners.
Dr. Xiaoming Xi
Speech title
Construct advances in relation to technology – Why are they stagnant in large-scale testing?
Abstract
From the time technology being seen as a source of construct irrelevance variance (Taylor et al., 1998) to generative AI tools being advocated for integration into the construct (Voss et al., 2023; Xi, 2023), conceptions of language tests may have shifted beyond imagination within just around 25 years due to accelerated breakthroughs in technology. While serious debates regarding the nature of language constructs have been happening that conjecture what our next generation of language tests should look like, our large-scale testing practices have remained almost the same as decades ago. As language testing professionals, how do we work together to push the boundaries of language tests?
Language tests are expected to mimic real-world communication as much as feasible. Nowadays the widespread use of assistive and even generative AI tools by language users in real life has motivated new inquiries into the nature of language ability and started to push language testers to revisit the nature of language competence and how our language testing practices can be transformed.
In this talk, I will first clarify the use of technology as a medium or aid of communication in real life vs. a tool to enable test delivery, as for a long time they may have been conflated. This distinction is important as we ponder over the exact role technology plays in the construct definition of language tests.
Expanding on Xi (2024), I then delineate four approaches to addressing the role of technology in defining language test constructs, Outright Rejection, Forced Acceptance, Cautious Acceptance, and Progressive Embracing. The first approach treats technology as a potential source of construct irrelevance. A good example is using a paper and pencil format to test writing skills, driven by the conception that writing conventions such as spelling and handwriting skills are a core part of writing ability and keyboarding skills are not pertinent to the measurement of writing. The second approach recognizes the role of technology to a limited extent by accepting basic computer literacy skills (e.g., reading on the computer screen and keyboarding skills) implicitly as part of the constructs, mostly motivated by the need to use computer-based delivery to scale up testing. This approach is most commonly adopted in large-scale testing, although test designers would still be concerned about the potential impact of technology skills on test performance. The Cautious Acceptance approach, which may accept controlled use of assistive tools by test takers such as spelling and grammar checkers and online dictionaries, has not made its way into large-scale testing yet. The final approach takes the most progressive stance and defines language constructs as communication skills fully integrated with computer and digital information literary skills, such as the ability to effectively use the full set of editing tools and even generative AI tools to accomplish a communication task, and to use digital technologies to search and identify needed information and evaluate, organize, and synthesize it to fulfill a task. This final approach, although having been advocated by researchers, seems to be far-fetched for real-world applications currently.
What are some of the barriers to the conceptualization and operationalization of the last two approaches? I believe some conceptual challenges remain to be resolved towards shifting our long-established thinking about language constructs. The conventions of language, such as writing conventions or even handwriting skills, remain a core component of writing curriculum around the world and what we test in writing, for example. The use of any assistive tools may “contaminate” or “complicate” the assessment of language skills, a view held by many mainstream language testers. The use of generative tools, such as ChatGPT, has often been seen as cheating tools and banned in many classroom assignments and assessments. Differential exposure to technology and technological tools and opportunities to use and practice with technological tools would also introduce potential fairness issues, a major hurdle to innovation in large-scale testing.
Providing learners and test takers access to these tools would be a leap of faith for many language educators and testers and make them uncomfortable as the speed at which AI technology advances has far out-paced evolutions of language constructs and testing practices. Language classroom teachers have been more adaptive, taking the plunge in using these AI tools in classroom assessment. Contrary to the common practice of using innovations in large-scale testing as a catalyst for language education reforms, designers of formative assessment will perhaps lead the way in this digital era, preparing the ground for possible future changes in the practice of large-scale language tests.
Biodata
Xiaoming Xi is Director at the Hong Kong Examinations and Assessment Authority, leading the Assessment Technology and Research Division, the Education Assessment Services Division and the International and Professional Examinations Division. Previously she was Executive Director of New Product Development and Senior Director in R&D at ETS. Her research leadership has impacted global large-scale tests such as ETS’s TOEFL, TOEIC and higher education tests as well as Hong Kong’s tests for college admissions, teacher certification and students’ progress monitoring.
A strong contributor to the educational assessment community, Xiaoming has been a co-opted Council Member of the International Test Commission (ITC) since September 2023. She has also served on the Executive Board of the International Language Testing Association (ILTA), chaired various ILTA award committees, and currently chairs the ILTA By-Laws Committee.
Xiaoming has been on the Editorial Board of several leading assessment journals, and has won multiple awards, including the 2015 Top 25 Women in Higher Education and Beyond, the Sage/ILTA Best Book Award, the ILTA Best Language Testing Paper Award, the ETS Scientist Award, the ETS Presidential Award, etc.
Xiaoming has published widely in assessment theories and practices including validity, fairness, construct definition, assessment design, human and automated scoring, and AI technology. She has guest edited two AI-related special issues “Automated Scoring and Feedback Systems” and “Advancing Language Assessment with AI and ML” for Language Testing and Language Assessment Quarterly respectively and has multiple patents in applications of AI.