Authors - Layan Sawalha, Jiamei Deng, Temitope Omotayo Abstract - The accurate analysis of hybrid medical datasets consisting of textual reports and diagnostic images plays an important role in the early detection and better outcomes of breast cancer patients. A novel deep learning framework is proposed in this paper that combines Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-trained Transformer-2 (GPT-2), and Convolutional Neural Networks (CNN) to overcome the challenges of multimodal data in breast cancer. This framework combines the strengths of BERT and GPT-2 for extracting rich contextual features from text with CNNs for capturing complex patterns in diagnostic images. By integrating textual and visual features into unified latent representations, this fusion enables accurate classification of breast cancer, distinguishing malignant from benign cases using both text and imaging data. The proposed framework lessens the bottleneck in multimodal to achieve outstanding results with an accuracy of 1.00, hence remarkably improving the precision of breast cancer diagnostic.