The Pandora’s box that is ML
Learn from the best with Tasia Lydia
ML identified cashew
Geo Gecko has been on the machine learning journey for about 3 years now morphing into Fieldy mid-way. This was one of those journeys that start with just one step in the right direction coupled with another and another until you finally see where it can possibly go. We started machine learning while working on a project for IITA to identify bananas in western Uganda. This project involved us using a desktop application, QGis to quantify bananas based on satellite imagery. This project was a very big eye opener for us in terms of what we need from a satellite image to be able to use it to identify a given crop. It was also within this project that the first steps of the satellite imagery preprocessing methodology was developed. However, the fact that we were using a desktop application to run the classification really limited us in terms of how we could influence the working of the model. This was the largest shortcoming for us and what led us to open up the Pandora’s box of data modelling.
Enter Python programming. To be able to understand what was happening behind the models and influence the modelling process, we needed to jump into building our own models. In order to do this, we doubled down on our Python programming skills and set off to find machine learning packages and libraries available in Python. Our first encounter was with the famous sci-kit learn which was easy to understand and implement. The learning curve was not very steep since we already had a good understanding of python and how satellite imagery could be worked. So we ventured into the available algorithms like KNN, random forest, decision trees and catboost classifier. Using this new approach, we recreated the banana classification using the catboost classifier and we able to get results that were slightly better than the results obtained from the initial desktop approach. Despite the results being better than those attained from the desktop approach, we still felt that there were ways to go with getting the best possible output.
Being the company that we are, we strive to be at the front of what is happening in the technology world. As such, we had been hearing of neural networks as the newest and more robust way to carry out data modelling. So the only next logical step for us was to try it. We hunkered down and started the pursuit of neural networks for classification. This is when we got introduced to Tensorflow, a package for machine learning that runs in Python which we were already familiar with, so we gave it a go. The learning curve on this one was steeper than that for sci-kit learn since it entails actually providing the model with information on how you want the modelling to be done. There were a lot of variables to be considered and as such, a lot to learn. However, this did not deter us and a few months later we were able to start implementation of this type of modelling.
By this time, we had realised the great importance of having a complete training dataset and as such discarded the banana dataset due to some incompleteness issues and took on a new dataset for cashew provided by Radiant Earth. We believed that using the best possible training dataset we could get our hands on would help us to make conclusions on whether the model was actually good or bad, without being biased by the shortcomings of the training data.
So we started building the cashew model. It took us quite some time to understand a lot of the decisions that needed to be made along the way in the modelling process because being able to code for Tensorflow is one thing. But prepping data to be used in Tensorflow is also a whole other beast. Initially, we were working with small areas, less than 300sqkm. So we built the cashew model using data for 132sqkm and were able to get an accuracy of about 80%. This was a comfortable amount of data and as such, we didn’t automate most of the processes involved in taking raw satellite images and making them ready to be ingested by the model.
At this point, we had started to talk with possible users of this model and realized that they would like the model to be applied in larger areas of about 250,000 sqkm. This posed a challenge for us in terms of how to optimise our processing methodology to ingest such huge amounts of data. But being who we are, it was a no-brainer that we would figure it out. So we set to work and found ways in which we can not only ingest larger quantities of data but also cut down on processing time and resources. This was a turning point in our cashew model as it made it easier for us to make changes to the model and reflect results in record time, as well as trying the model in new scenarios with new training data and being able to get results in less than a month.
At this point, the current version of the model is based on Sentinel-2 satellite images, with this data being presented to the model at a resolution of 10m*10m. The model is built in Benin and can be used within an area of 200,000 sqkm around the area it was trained. The model is currently able to identify cashew at an accuracy of 84% and can easily be recreated for another area once training data for that particular new area has been made available.