Blog post 1: Cassava Leaf distribution

5 min readFeb 22, 2021

By — Emmanuel Peters, Pei Pei Li, Yongming Han

Link to Notebook — https://www.kaggle.com/emmanuelpeters/notebookblogpost

Cassava is the third largest provider of carbohydrates in the tropics after Rice and Maize and thus forms a key food security crop grown by small-time farmers in Africa.

This blog series is documentation for our progress in the Kaggle competition https://www.kaggle.com/c/cassava-leaf-disease-classification/overview as a part the course DATA2040 at Brown University.

The aim is to classify the disease afflicting a Cassava plant from a picture taken of it. The dataset is a collection labelled images taken during a regular survey in Uganda. These were taken by farmers and later categorized by experts at the National Crops Resources Research Institute (NaCCRI) in collaboration with the AI lab at the Makere University.

We attach a notebook which contains a Minimum viable product which performs basic EDA on the dataset and runs a baseline model on the dataset. The code is partly self-written and partly sourced from multiple Kaggle submissions (citations for the notebooks can be found in the comments.

EDA

We first print the number of images we have to see that the dataset is fairly large with 23197 images. Then we look at the possible classifications and the distribution with respect to the classes.

my_colors = 'kckckc'
plt.bar(x=train_df['disease_name'].value_counts().index, height=train_df['disease_name'].value_counts().values, color =my_colors)
plt.xticks(rotation=90)
plt.show()

As we can see the dataset is imbalanced, so we might need to account for this in our model using oversampling or imbalanced weight initializations.

Lets look at a random batch of the training data for each class:

The characteristics for the CBB are: angular spots, brown spots with yellow borders, yellow leaves, leaves wilting.*

*All the characteristics are taken from discussion.

The characteristics for the CBSD are: yellow spots

The characteristics for the CGM are: Yellow patterns, irregular patches of yellow and green, leaf margins distortion, stunted

The characteristics we leverage for CMD: severe shape distortion, mosaic patterns

Now we will look at an implementation of a CNN and discuss its results but before that it is a good idea to figure out a baseline score for accuracy. The most naive model which predicts every single label using the majority rule will still get 61.49% accuracy. (See: figure 9)

Figure 9: Class distribution proportions

FastAI model

Let’s turn now to a more sophisticated baseline model: a slightly finetuned version of a ResNET50 model, courtesy of Zach Mueller. The model was implemented using the FastAI library, a high-level deep-learning library that allows us to abstract away low-level details that aren’t important for our task. The FastAI library comes with some data augmentations; the ones we used for this model were:

flipping, rotating, zooming, warping, and lighting transforms
Picking a random scaled crop of an image and resizing it, so that the network gets presented with images of objects where the objects are in slightly different sizes

We also finetuned the network with the ranger optimizer (https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d). Since the labels were noisy — — since, in other words, some of the labels were wrong — — we tried to get the model to ‘learn to lower its confidence in the predictions’ by using label-smoothed cross-entropy loss as our loss function.

And finally, we follow Mueller in using a finetuning method that in turn riffs off Fastai’s implementation of Leslie Smith’s 1cycle policy.

When training, because we didn’t have time, we trained the model for one frozen epoch (i.e., for one epoch where the weights of the initial layers of the neural net cannot change), and then just one unfrozen epoch. This resulted in a validation accuracy of 77.3%. Just training the model for two epochs, it’s worth emphasizing, already got us better results than the majority-vote model! (Mueller got 80.6% accuracy when he ran it for a couple more unfrozen epochs.)

References:

Cassava Leaf Disease Classification

Identify the type of disease present on a Cassava Leaf image

www.kaggle.com

Cassava classification - EDA & fastai starter

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

Starter EDA : Cassava Leaf Disease

Explore and run machine learning code with Kaggle Notebooks | Using data from Cassava Leaf Disease Classification

www.kaggle.com

Cassava Leaf Disease: EDA and Outliers

Explore and run machine learning code with Kaggle Notebooks | Using data from Cassava Leaf Disease Classification

www.kaggle.com

Cassava

Manihot esculenta , commonly called cassava (), manioc, yuca, macaxeira, mandioca, aipim, and agbeli, is a woody shrub…

en.wikipedia.org

Blog post 1: Cassava Leaf distribution

EDA

Cassava Leaf Disease Classification

Identify the type of disease present on a Cassava Leaf image

Cassava classification - EDA & fastai starter

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

Starter EDA : Cassava Leaf Disease

Explore and run machine learning code with Kaggle Notebooks | Using data from Cassava Leaf Disease Classification

Cassava Leaf Disease: EDA and Outliers

Explore and run machine learning code with Kaggle Notebooks | Using data from Cassava Leaf Disease Classification

Cassava

Manihot esculenta , commonly called cassava (), manioc, yuca, macaxeira, mandioca, aipim, and agbeli, is a woody shrub…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Emmanuel Peters

No responses yet