Abstract: Image captioning, situated at the intersection of computer vision and natural language processing, seeks to generate captions that are linguistically fluent, accurate, and semantically rich.