This study details the technical design and evaluation of a Multimodal Learning Analytics (MMLA) system designed to enhance spoken language acquisition in language café settings. Utilizing the MMLA Model for Design and Analysis (MAMDA) framework, we outline the development of a prototype system that integrates AI voice assistance with the collection and analysis of multimodal data, including audio and video. We provide details about the specific technologies and algorithms employed, such as the Arduino Nicla Vision board for participant tracking and deep learning techniques for audio analysis. The implementation of the prototype for real-world language café sessions highlights its potential for providing valuable insights into learning patterns and interaction dynamics. We discuss the system's performance and limitations, paving the way for future refinements and broader applications in education.