By Dave DeFusco
In digital education, one of the greatest challenges has been replicating the nuanced, adaptive nature of human teaching. Traditional e-learning systems, while effective in delivering static content, often lack the flexibility to provide real-time, contextually relevant explanations tailored to individual student needs.
In response, a team of researchers led by Dr. Youshan Zhang, assistant professor of artificial intelligence and computer science in the Graduate Computer Science and Engineering Department at the Katz School, presented a study, “Automatic Teaching Platform on Vision Language Retrieval Augmented Generation (VL-RAG),” which introduces an AI-powered teaching system that integrates visual learning with dynamic, interactive content to enhance student comprehension and engagement, at the 2025 IEEE Integrated STEM Education Conference at Princeton University in March.
“Automating teaching presents unique difficulties,” Ruslan Gokhman, lead author of the study, a 2024 graduate of the M.S. in Artificial Intelligence and currently a Ph.D. student in Mathematics at the Katz School. “Unlike human instructors, AI-driven platforms often struggle with providing personalized, real-time feedback that adjusts to each student’s learning pace.”
This gap is especially pronounced in complex subjects like artificial intelligence, machine learning and deep learning, where abstract concepts require adaptive and multimodal explanations. Traditional e-learning tools rely heavily on text-based content, leaving visual learners at a disadvantage. Furthermore, current platforms lack the ability to seamlessly integrate different forms of media into a coherent, interactive learning experience.
The VL-RAG system, proposed by the researchers, aims to overcome these challenges by leveraging a combination of deep-learning retrieval mechanisms and visual question-answering (VQA) technologies. The system dynamically retrieves relevant visual and textual explanations based on students’ questions, making learning more interactive and intuitive. Unlike traditional automated teaching tools that rely on pre-programmed responses, VL-RAG generates contextually relevant answers by analyzing a database of tailored images and explanations. This approach not only enhances comprehension but also reduces the need for constant human intervention, enabling more scalable and flexible learning solutions.
A web-based interface, called the Automatic Teaching Platform Based on VL-RAG, allows students to interact with course content through both text and visual queries. Students can input questions related to their coursework, and the system retrieves corresponding images and explanations in real-time. The platform is designed to support courses in machine learning, neural networks and deep learning—subjects where visualizing complex models and theories is crucial for understanding.
“The system’s ability to integrate both textual and visual data makes it especially valuable for STEM education,” said Jialu Li, a co-author and student in the M.S. in Artificial Intelligence. “For instance, a student struggling with the concept of convolutional neural networks can receive not just a written explanation but also annotated images demonstrating how different layers process input data. This dual-modality approach fosters deeper understanding and engagement.”
At the core of the VL-RAG platform is a sophisticated deep learning retrieval mechanism that optimizes the way educational content is accessed and presented. The system builds upon the research of visual question-answering (VQA) models, which combine computer vision and natural language processing to interpret visual data and generate textual responses.
To evaluate the effectiveness of their system, the researchers conducted extensive testing using the SparrowVQE dataset—a carefully curated collection of lecture slides, transcripts and question-answer pairs designed to enhance AI-driven educational models. study compared multiple AI models, including LLAMA, T5 and Bart Large CNN, finding that Bart Large CNN consistently delivered the most precise and reliable outputs.
Beyond higher education, this technology could revolutionize corporate training, medical education and even K-12 learning environments. Imagine a medical student studying human anatomy who can ask the system to visually break down a complicated surgical procedure, or a high school physics student using VL-RAG to interact with 3D models of electromagnetic fields. The possibilities are vast and transformative.
“The implications of VL-RAG extend far beyond a single classroom,” said Dr. Zhang, who is senior author of the study. “The platform’s adaptability allows it to be expanded across various subjects, making it a valuable tool for educational institutions looking to enhance digital learning. By providing real-time, visually enriched explanations, VL-RAG has the potential to transform how students engage with complex material.”