Abstract: The fusion of multiple modalities, such as vision and language, has led to significant progress in grounding and tracking tasks. However, this success has not yet translated to aerial single ...