Labels, Notes, Flags & Tags (Or: the importance of being annotated)

Generally speaking, the subject of Annotation is not one that is common at parties (or most places for that matter), but in certain parts of the technology world, it’s not only common but absolutely critical. Most of us will know that annotation can be defined as ‘the act of adding extra information/comments to data that are not part of the original content’, but less known is its importance across a huge range of technology fields – and yes, AI relies on it too.

Still here? Good.

In terms of its use, annotation can be used for various purposes such as improving software quality and legibility, enhancing web content and functionality, and training AI models. It has over time developed into a far more dynamic part of computer programming language and is now a critical tool for coders to retain control of older software programs etc. In the world of Agritech it is an increasingly important part of the conversation around synthetic datasets (of which more later), active learning & semantics.

Annotation does come with its challenges however; it’s not a task that can be done casually and requires utmost levels of concentration (or programming skill if done via automation). Quality & consistency are essential, which is no mean feat when dealing with large & complex datasets; it can also be time-consuming & resource-intensive if good ol’ humans are involved. The issue of ethics & privacy can also be a problem if dealing with sensitive information (e.g. medical or financial data).

Luckily crops don’t usually have medical records or bank accounts, and so the approach to annotating crop datasets is not fraught with potential lawsuits. It is however a crucial part of the process of identifying key differences when deploying imaging technologies. The use of synthetic datasets is becoming widespread and can be used to augment or replace real data, especially when the real data is scarce, expensive, or impractical to obtain (as is the case with much of the agricultural industry). Synthetic data can also be used to create more diverse and balanced datasets, as well as to test and validate the robustness and generalization of the models. For example, synthetic data can be used to create realistic images of crops under different conditions, such as weather, pests, or diseases, and use them to train and evaluate computer vision models for crop detection and diagnosis. (At AgriSynth we take this some stages further and are able to go down to a ‘leaf-level’ diagnosis – but that’s for another blog).

Annotating all of this synthetic data needs to be done with a structured, consistent approach and the use of ‘active’ (or machine) learning can not only reduce the amount and cost of annotation, it improves the accuracy and efficiency of the models. It will also enable human-machine collaboration, e.g. used to select the most ambiguous or difficult images of plants and use them to refine and update the computer vision models for plant identification and classification.

Annotation may be the irritating stepchild of the data world but without it we wouldn’t be able to fully understand what our datasets are telling us – and the consequences of that would make for a depressing (if slightly more entertaining) story at a party.