Transformers outperform CNNs in image recognition, requiring less computational power, revolutionizing computer vision.
Transformers, known for language tasks, can also work well for image recognition without needing convolutional networks. By using a pure transformer on image patches, the Vision Transformer (ViT) achieves great results on various image classification tests like ImageNet and CIFAR-100. ViT needs less computing power compared to traditional convolutional networks.