For the ftagn of it

My first entry going over lesson one was basically a whirlwind tour of how to do image classification. Naturally, a lot of interesting content was left out. Today, I’m just poking at the classification system to see if anything changes.

First, I mentioned in the first post that the model was trained using ResNet-34. If you didn’t have a chance to look at Deep Residual Learning for Image Classification, you might not know that there are different sizes of ResNet models that can be used here of various sizes. I’m going to give them a shot. You have an idea what the code should look like. Not that for my purposes, I’m using the dataset with nine classes.

In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.vision import *
from fastai.metrics import error_rate

bs = 64

# Leaving off most of the infrastructure for making and populating the folders for readability.

path = Path("data/fruit")
classes = ['apples', 'bananas', 'oranges', 'grapes', 'pears', 'pineapples', 'nectarines', 'kiwis', 'plantains']

data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
                                 ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

Now for the first bit we are using that is different, training with ResNet50. If it weren’t apparent from the name, it uses 50 layers instead of 34. This theoretically gives it a chance to learn more nuance, but it also introduces the risk of overfitting. We go!

In [3]:
learn = cnn_learner(data, models.resnet50, metrics=error_rate)
learn.fit_one_cycle(4)
Total time: 01:20

epoch train_loss valid_loss error_rate time
0 1.540670 0.397189 0.143713 00:38
1 0.851569 0.252047 0.071856 00:13
2 0.590861 0.260350 0.077844 00:14
3 0.448579 0.258741 0.083832 00:13

Differences? Here’s what jumps out at me:

  • This network took about twice as long to train, which is understandable given that it has many more layers.
  • The training time here is not uniformly distributed.
    • Running the other notebook again revealed that the timing for the equivalent using ResNet-34 also wasn’t uniformly distributed; it was just a fluke. I’m still curious as to why it might change, though.
  • We see a substantial improvement in training loss (~0.65 vs. ~0.41), and slight improvement in validation loss (~0.37 vs. 0.32) and error rate (~0.12 vs. ~0.1).

Okay, not bad. On closer inspection, the ResNet-50 example uses a resolution of 299 instead of 224 for its input images. Additionally, it halved the batch size and used the default number of workers (I am operating under the hypothesis that this is for memory reasons). How do those compare?

In [4]:
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2,
                                 ds_tfms=get_transforms(), size=299, bs=bs//2).normalize(imagenet_stats)
learn = cnn_learner(data, models.resnet50, metrics=error_rate)
learn.fit_one_cycle(4)
Total time: 01:51

epoch train_loss valid_loss error_rate time
0 1.252562 0.303597 0.089820 00:42
1 0.704307 0.329760 0.101796 00:23
2 0.469999 0.270663 0.125749 00:23
3 0.361943 0.265046 0.119760 00:23

With the higher resolution images, we still see improvement with the training loss, but also some fluctuations in the validation loss and error rate. Granted, my source ran this four eight cycles and saw a similar fluctuation in the initial four. Easily remedied.

In [5]:
learn.fit_one_cycle(4)
Total time: 01:35

epoch train_loss valid_loss error_rate time
0 0.120617 0.297528 0.107784 00:23
1 0.134901 0.270895 0.077844 00:23
2 0.124303 0.296516 0.101796 00:23
3 0.117052 0.297157 0.113772 00:23

I’m seeing training loss of ~0.12, validation loss of ~0.30, and an error rate of ~0.11. What does all of this mean?

Relationship of Training Loss, Validation Loss, and Error Rate

All of this is very abstract, so I decided to do some searching. Andrej Karpathy has a GitHub page that, among other things, includes tips for the interpretation of these values. Using this as a reference, because our training loss is much smaller than our validation loss, this model is probably overfitting.

In fact, inferring from this advice, we actually want our training loss to be higher than our validation loss. From the looks of it, this learner has been overfitting since epoch 3.

Why?

Simply, loss is an expression of an incorrect prediction; a loss of 0.0 would mean that all of our predictions were correct. Going from Karpathy’s post above, why do we want our training loss to be larger than our validation loss? In that situation, it would mean that our training was stringent enough and our model confident enough that it will perform better than expected when being validated.