The Problems of Gathering Images for Crop/Weed Scene Datasets

Its so obvious to a skilled Agronomist as to what is soil, crop, crop debris, weeds stones… but to AI, not so much.

A skilled Agronomist (I used to be one!) can walk into a field and immediately identify the crop and the weed species. Having said that, there are some published research where 12% of the time, various expert Agronomists disagreed on what the species was, especially at early growth stages.

When we train an AI solution to look at a crop/weed scene it knows nothing of what it is looking at. For that we have to train the solution and tell it that certain features belong to certain classes like crops, weeds, stones, etc., this is the AI solution classifying an image. The training typically involves showing the AI software, many example images of everything we want it to classify (identify) together with accompanying labelled or annotated images that specify what is crop, weeds, stone, etc.

So what’s the problem?

Well, to get the solution to work reliably in a complex crop/weed landscape, we need to show the AI software a wide range of situations resulting is a large dataset of many hundreds of thousands of images. These images have to be collected from the Real World and are often of variable quality, often proprietary so not easily available to start-ups.

Such a start-up company will have to focus on a specific problem area to reduce the cost and time of image collection.

But this is only the beginning. The image then have to be annotated. Even now many companies use ‘bounding boxes’ drawing a rectangle around a weed to identify it to the AI. This is clumsy at best and completely inaccurate at worse. What we really need is to allocate every pixel in an image to a separate class… a crop, a weed. a stone and so on. This is painstaking work and almost beyond what humans can achieve, certainly for datasets of any real size.

And what about where we have overlap[ping leaves and crop debris and splashes on leaves and disease making the leaves deform and pests eating the leaves and…  the list goes on.

That’s why we don’t have publicly available datasets available today and why many companies are even struggling to produce their own.

A poor quality dataset, poorly annotated, will result in an AI solution that performs poorly in the field, requiring several passes to identify everything. At worse, it will just keep failing and not be commercially acceptable.

After over 2 years in development, we can change this.

With a synthetic image, we can have consistent lighting and camera angles. We can control the species that appear. We can annotate to account for EVERY pixel in an image, 100% pixel perfect. We can make AI solutions commercially acceptable from the start.

Are synthetic images as good as Real World images? In many cases they are better, as far as the AI solution is concerned. And we can include those species that you may on see once in a blue moon, but never be able to gather from the Real World.