Odometry estimation plays a key role in facilitating autonomous navigation systems. While significant consideration has been devoted to research on monocular odometry estimation, sensor fusion techniques for Stereo Visual Odometry (SVO) have been relatively neglected due to their demanding computational requirements, posing practical challenges. However, recent advancements in hardware, particularly the integration of CPUs with dedicated artificial intelligence units, have alleviated these concerns. This thesis explores the enhancement of autonomous robot navigation through the integration of attention mechanisms with stereo images, particularly in environments where GPS signals are unreliable or absent. The core of this study is the development of a novel sensor fusion model that utilizes one image as a means of calculating attention weights for another image, and combine the result with inertial data to improve odometry estimates. A set of ablation experiments was conducted with different architectures and sensor fusion to find the best setup, using the KITTI dataset. The results demonstrate the effectiveness of our proposed methods, particularly the use of early fusion techniques and attention mechanisms, which significantly enhance the accuracy of navigation paths relative to the ground truth. Furthermore, we compared our Stereo Attention-based Visual Inertial Odometry model (SATVIO) to state-of-the-art to demonstrate its performance. Despite limitations that restricted extensive training, our findings suggest that, with further optimization and extended training, SATVIO could match or surpass current state-of-the-art approaches in visual inertial odometry.