The Performance of AI in China’s National Civil Service Exam (NCSE)
Abstract
This study explores the potential of large language models, specifically ChatGPT-4, in the context of high-stakes exams by evaluating its performance on the Chinese National Civil Service Exam (NCSE). With the rapid advancements in artificial intelligence, understanding how well AI can perform on exams designed to test comprehensive cognitive skills has become increasingly relevant. Inspired by previous studies that assessed ChatGPT’s performance on the U.S. Bar Exam, we aimed to extend this evaluation to measure its “intelligence quotient” through the NCSE, a more diverse and demanding benchmark than traditional IQ tests.
The NCSE includes the Administrative Aptitude Test (AAT) and Argumentative Essay Writing (AEW), covering a range of cognitive domains such as logical reasoning, reading comprehension, quantitative analysis, and memory-based questions. This structure allows for a holistic assessment of ChatGPT-4’s capabilities, including language processing, logical reasoning, mathematical calculations, and data analysis. Results indicate that ChatGPT-4 demonstrates considerable strengths in natural language comprehension and structured writing but reveals notable limitations in visual recognition and logical reasoning, especially in tasks requiring abstract thought and multi-step problem-solving.
Using the Fenbi grading platform for evaluation, ChatGPT-4’s scores were compared against average human test-takers, providing a reliable benchmark of the model’s performance relative to human standards. The study shows that ChatGPT-4 exceeds the human average in certain areas, yet falls short of the highest human scores, underscoring the need for continued development in AI’s logical reasoning and response regulation capabilities.
Our findings suggest that with ongoing advancements, AI models like ChatGPT-4 could potentially serve as valuable tools in academic and professional assessments. The NCSE offers a robust framework for evaluating AI’s practical cognitive skills, marking an innovative step in redefining intelligence metrics for AI in complex, real-world scenarios. This research contributes to the growing body of knowledge on AI assessment, setting the foundation for future applications and improvements in AI-driven evaluation systems.
Comments