Image classification i.e. teaching a model to predict the category of an image is one of the most fundamental problems in Computer Vision. In almost every AI technology you come across these days image classification is built in. For example, that's how facebook can suggest you to "tag" a friend in a picture.
Frameworks like PyTorch, Tensorflow and so on make it incredibly easy to train an image classification model. But really, an image classifier is much more than the network you train. Further, there are several steps that affect your model. To give you a more concrete example, say you're training a model to classify a dog v/s cat. And 2 representative images of your dataset are -
![]() |
![]() |
Now, you want to use this trained model to predict whether this image is a cat or dog.
![]() |
Contrary to your belief, chances are high that the model will get confused. The reason is simple - it's never seen a rotated image. We run into such problems very often in slightly more complicated tasks like text recognition i.e. classifying a word letter by letter. Your model is NOT invariant to rotation because it has never seen rotations of an image before and it get's confused at run time. Such problems are rampant, and online tutorials teaching you to fine-tune Imagenet or Inception v3 never go over how to build a classifier which addresses such things specific to your purpose.
The goal of this tutorial is to break down everything that's happening in training these models, and to give you complete control over the pipeline, so that you can build not just a model, but a meaningful and useful pipeline for your specific purpose