How To Train Your Algorithm
The Algorithms are coming!
Low-cost deployable crop sensing technology is enabling unprecedented levels of data collection. Various industry players are developing both proprietary and commercially available cultivation management systems that leverage machine learning and artificial intelligence. But it's worth considering ML and AI adoption as a continuum, much like Autonomous Driving, where the machines slowly take over management, first by informing, then recommending, then ultimately fully in control. But data is required both for model development and validation, but also contextual training upon deployment. Getting ready for AI starts now. Innovative operators are asking themselves, ‘How good is my data?’
Good data is the bedrock upon which ML algorithms stand. Bad data will be useless at best, and catastrophic at worst. Good integrators will validate all sensors, but calibrations can drift, EM radiation can bias, and variability exists in any operation. Data validation is essential, and validation should be done at all ranges of operating conditions and periodically maintained. Don’t believe me? Go test the influence of radiant bias on your temperature and humidity readings, or check when the last time your pH sensors were replaced. When monitoring, these biases may not be particularly problematic, but once you build control loops on these data points they certainly will
Key Considerations for Building Your Dataset
Data Quality
Signal-to-Noise Ratio: Ensuring a high signal-to-noise ratio is critical for data integrity. Employ data pre-processing techniques like smoothing and outlier detection to minimize noise. For the green thumbs reading, this is where you can play an important role. Annotate your data for every batch. Make it a practice to develop a log of significant events, and code them for easy analysis. Define your thresholds for invoking those annotations objectively and stick with that a priori standard. The goal is to capture external events that influenced the system, but were not the product of the system, as labeling or removing these records can improve training efficiency. Edge cases are not what you want to be training on, even though sometimes the greatest insights are born of accidents.
Data Granularity: The granularity of your data should be in sync with the intricacies of the phenomenon you are modeling. Choose a level that brings out the most useful patterns. For example, if you are measuring something at a room level, such as gaseous composition or a composite leachate sample, fine resolution is likely not helpful, but if you are measuring minute details and a local level, such as the use of a stem and leaf psychrometer, the value is in detecting minute variation. For biofeedback applications like early detection of disease, sensitivity can make all the difference, but for routine environmental monitoring, standard commercial sensing is sufficient.
Data Quantity
Volume: While more data is generally advantageous for training an ML model, it’s essential to focus on collecting data that is actually relevant and offers insights. Lots of information is generated by modern precision indoor farms, and the computational requirements for training increase exponentially with the scale of the dataset. Thus, it's important not just to make sure all your relevant data is integrated into one comprehensive database, it's just as important to exclude those data streams that aren’t related to the factors of production or crop response, or those data streams which contain confounders.
Sampling Frequency: Sampling frequency must be chosen carefully. Too frequent, and you may capture noise; too infrequent, and you may miss out on important variations. Most events in a grow are occurring slow enough that 1-5 minute intervals give good resolution without becoming burdensome from a data management perspective, but its possible greater insights might be contained with higher sampling rates. For example, high frequency sampling is crucial in PID tuning or for commissioning other mechanical systems, but is likely of less value for crop and environmental monitoring, or measurements meant to generalize a population with a known variance.
Data Diversity
Your dataset should cover all practical factors of production and biofeedback response. This list includes measurements of the rootzone (ph, EC, temperature, water saturation, delivery volume, leachate volume), the environment (light level, temperature, humidity, gaseous composition), and a range of biotic factors. While the abiotic factors are quite simple to cover off on, the key to achieving a comprehensive data set is finding ways to automate the continuous collection of biofeedback. While manual crop registrations are appropriate for the cycle to cycle optimization grower conduct as a matter of course, autonomous cultivation management systems cannot be dependent on human input. Advanced vision systems are capable of many things from crop nutrition analysis, to leaf area indexing, to crop labour scheduling, pest scouting, and more. However, more accessible deployments can generate useful data streams such as ultrasonic sensing for crop statue and canopy density, leaf temperatures to derive transpiration models, and gravimetric plus capacitance sensing in root zones to derive water status of the crop. Creative thinkers will no doubt recognize there are more than a handful of ways to generate proxies for key crop performance metrics. The data need not directly measure the mechanism in question, and various proxies may be suitable. However, it's best not to rely on one proxy only, and the further the proxy is related to the phenomenon it's meant to represent, the more appropriate it is to measure it from different perspectives.
Target Variable: The end-goal or the target variable should be clearly defined. Whether you aim to optimize yield, resource efficiency, or another metric, make sure it's accurately measured and labeled. The perspective on what to optimize for and with what weight doesn’t come from the operational team, it comes from the Finance department. Although it's common for Operations teams to have KPI’s, these are incredible rudimentary endpoints for the optimization contemplated by AI cultivation management. If the optimization parameters provided to the algorithm include the detailed costing of all inputs, and a matrix of economic outcomes for various levels of yield, quality, and performance, a more profitable optimization can be expected.
Conclusion
Creating a robust dataset is both an art and a science, especially in a complex field like precision indoor farming. By attending to aspects like data quality, quantity, diversity, relevance, and integrity, you're setting the stage for the development of ML models that can deliver tangible benefits in optimizing your farming operations.
If you have any questions, insights, or would like to share your own experiences in developing data-driven solutions for indoor agriculture, feel free to reach out. Here's to harnessing the power of data for a more efficient and sustainable future