Background Virtual standardized patients(VSPs)have emerged as a novel tool in medical education, widely adopted to enhance students’ clinical interview skills.With the rapid development of generative artificial intelligence, VSP systems powered by large language models(LLMs)have become a new focus of research.However, few studies have systematically compared the performance of different LLMs in simulating patient roles.Objective This study aims to compare the applicability of two mainstream LLMs, ChatGPT-4o and DeepSeek, in VSP-based medical interview simulations, focusing on their differences in history-taking performance,linguistic naturalness, clue guidance,and educational support.Methods A quasi-experimental study was conducted involving 60 fourth-year clinical medicine undergraduates from a medical school.All participants had completed a diagnostics course and possessed basic interviewing skills.Students were assigned to either the ChatGPT-4o or DeepSeek group based on the parity of their student ID numbers.Each participant conducted a text-based simulated interview with a VSP presenting with acute appendicitis, then submitted both a preliminary diagnosis and a structured satisfaction questionnaire.Results ChatGPT-4o demonstrated superior performance in structured information integration, clue-based prompting, and system stability.In contrast, DeepSeek showed more natural language affinity and emotional responsiveness,reflecting stronger humanistic communication traits.The two models displayed divergent strengths within the VSP framework, suggesting that system selection and integration should be tailored to specific teaching objectives.Conclusions Future research should expand the scope to include diverse disease scenarios, interaction modalities, and evaluation dimensions, to comprehensively assess the educational utility and adaptability of LLM-driven VSP systems in medical training.