Nearly every real-world deployment of machine learning models suffers from some form of shift in data distributions in relation to the data encountered in production. This aspect is particularly pronounced when dealing with streaming data or in dynamic settings (e.g. changes in data sources, behaviour and the environment). As a result, the performance of the models degrades during deployment. In order to account for these contextual changes, domain adaptation techniques have been designed for scenarios where the aim is to learn a model from a source data distribution, which can perform well on a different, but related target data distribution. In this paper we introduce a variational autoencoder-based multi-modal approach for the task of domain adaptation, that can be trained on a large amount of labelled data from the source domain, coupled with a comparably small amount of labelled data from the target domain. We demonstrate our approach in the context of human activity recognition using various IoT sensing modalities and report superior results when benchmarking against the effective mSDA method for domain adaptation.