5 Must-Have Features in a yulu pear for export
AHG-YOLO: multi-category detection for occluded pear fruits in ...
1 Introduction
The pear is a nutrient-rich fruit with high economic and nutritional value, widely cultivated around the world (Seo et al., ). China has been the largest producer and consumer of fruits globally, with orchard area and production continuously increasing (Zhang et al., ). Fruit harvesting has become one of the most time-consuming and labor-intensive processes in fruit production (Vrochidou et al., ). In the complex orchard environment, accurate fruit detection is essential for achieving orchard automation and intelligent management (Bharad and Khanpara, ; Chen et al., ). Currently, pear harvesting mainly relies on manual labor, which is inefficient. Additionally, with the aging population and labor shortages, the cost of manual harvesting is rising, making the automation of pear fruit harvesting an urgent problem to address. In recent years, researchers have been focusing on mechanized and intelligent fruit harvesting technologies (Parsa et al., ). However, in the unstructured environment of orchards, fruits are often occluded by branches and leaves, and their growth orientations vary, which affects the accuracy of detection and localization, posing significant challenges to automated fruit harvesting (Tang et al., ).
You can find more information on our web, so please take a look.
Traditional image processing methods for detecting fruit targets require manually designed features, such as color features, shape features, and texture features (Liu and Liu, ). These methods then combine machine learning algorithms with the manually designed features to detect fruits, but detection accuracy can be easily affected by subjective human factors, and detection efficiency is low (Dhanya et al., ).
In recent years, with the development of image processors (GPUs) and deep learning technologies, significant progress has been made in the field of object detection. Algorithms such as Faster R-CNN (Ren et al., ) and SSD (Liu et al., ) have demonstrated excellent performance in general tasks. However, these methods still face challenges in real-time processing or small object detection. The YOLO series such as YOLOv5 (Horvat et al., ), YOLOv6 (Li et al., ), YOLOv7 (Wang et al., ), YOLOv8 (Sohan et al., ), YOLOv9 (Wang et al., ), YOLOv10 (Alif and Hussain, ), YOLOv11 (Khanam and Hussain, ) has shown improvements in both speed and accuracy, leading many researchers to utilize YOLO algorithms for fruit detection research. Liu et al. () proposed a new lightweight apple detection algorithm called Faster-YOLO-AP based on YOLOv8. The results showed that Faster-YOLO-AP reduced its parameters and FLOPs to 0.66 M and 2.29G, respectively, with an mAP@0.5:0.95 of 84.12%. Zhu et al. () introduced an improved lightweight YOLO model (YOLO-LM) based on YOLOv7-tiny for detecting the maturity of tea oil fruits. The precision, recall, mAP@0.5, parameters, FLOPs, and model size were 93.96%, 93.32%, 93.18%, 10.17 million, 19.46 G, and 19.82 MB, respectively. Wei et al. () proposed a lightweight tomato maturity detection model named GFS-YOLOv11, which improved precision, recall, mAP@0.5, and mAP@0.5:0.95 by 5.8%, 4.9%, 6.2%, and 5.5%, respectively. Tang et al., addressed the issue of low detection accuracy and limited generalization capabilities for large non-green mature citrus fruits under different ripeness levels and varieties, proposing a lightweight real-time detection model for unstructured environments—YOLOC-tiny. Sun et al. () focused on efficient pear fruit detection in complex orchard environments and proposed an effective YOLOv5-based model—YOLO-P—for fast and accurate pear fruit detection. However, in complex, unstructured orchard environments, factors such as varying lighting conditions, occlusions, and fruit overlaps still affect recognition accuracy and generalization capabilities. Additionally, existing models often suffer from high computational complexity and excessive parameters, making them difficult to deploy on resource-constrained mobile or embedded devices. To address these challenges, researchers have been committed to designing high-precision, fast detection models that meet the requirements for real-time harvesting.
Current research on pear fruit detection has made some progress. Ren et al. () proposed the YOLO-GEW network based on YOLOv8 for detecting “Yulu Xiang” pear fruits in unstructured environments, achieving a 5.38% improvement in AP. Zhao et al. () developed a high-order deformation-aware multi-object search network (HDMNet) based on YOLOv8 for pear fruit detection, with a detection accuracy of 93.6% in mAP@0.5 and 70.2% in mAP@0.75. Lu et al. () introduced the ODL Net algorithm for detecting small green pear fruits, achieving detection accuracies of 56.2% and 65.1% before and after fruit thinning, respectively. Shi et al. () proposed an improved model, YOLOv9s-Pear, based on the lightweight YOLOv9s model, enhancing the accuracy and efficiency of red-skinned young pear recognition. The model achieved precision, recall, and AP rates of 97.1%, 97%, and 99.1%, respectively. The aforementioned studies primarily focus on single pear fruit detection during maturity or young fruit stages. However, in practical harvesting scenarios, considerations such as robotic arm picking strategies and path planning are also crucial (Wang et al., ). The picking strategy and path planning of robotic arms are closely related to the fruit’s growth position. Detailed classification of fruit location information enables harvesting robots to adapt flexibly to varying environmental conditions, dynamically adjusting path planning and grasping strategies to ensure efficient and precise harvesting operations. This enhances the system’s flexibility and robustness in complex scenarios (Nan et al., ).
Based on the aforementioned background, this paper proposes a lightweight intelligent pear orchard fruit detection method, AHG-YOLO, using YOLOv11n as the base model. First, the traditional sampling method in the YOLOv11n backbone and neck networks is replaced with ADown to reduce computational complexity while improving detection accuracy. Next, a new detection head structure is developed using the “shared” concept and group convolution to further lighten the model without compromising detection performance. Finally, the CIoU loss function in YOLOv11n is replaced with GIoU to enhance the model’s accuracy and fitting capability. The improved model not only maintains high recognition accuracy but also reduces the model size and computational cost, making it easier to deploy on mobile devices. This provides technical support for optimizing robotic picking paths and meets the demands of intelligent harvesting in pear orchards.
2 Material and methodology
2.1 Image collection
The Hongxiangsu pear, known as the “king of all fruits,” is a hybrid descendant of the Korla fragrant pear and the goose pear, and is a late-maturing, storage-resistant red-skinned pear variety. The fruit is spindle-shaped, weighing an average of 220 grams, with a maximum weight of 500 grams. The fruit surface is smooth and clean, with a bright red color. The flesh is white, fine-grained, sweet, and aromatic, with medicinal properties such as clearing heat, moisturizing the lungs, relieving cough, quenching thirst, and aiding in alcohol detoxification. It also has health benefits for conditions such as hypertension, high cholesterol, and arteriosclerosis. This study focuses on the Hongxiangsu pear, and data was collected from the Modern Agricultural Industry Technology System Demonstration Base of the Fruit Tree Research Institute at Shanxi Agricultural University, located in Taigu District, Jinzhong City, Shanxi Province (112°32’E, 37°23’N). Considering that the harvesting robotic arm needs to adapt to the complex environment of the orchard during the harvesting process, pear images were captured from various angles, distances, and time periods using a Vivo Z3i smartphone. A total of 1,000 pear images were collected, and unusable images were filtered out, leaving 734 usable images. The complex orchard environment includes scenarios such as single fruit, multiple fruits, cloudy weather, overlapping fruits, and branches and leaves obstructing the view. Some sample images are shown in Figure 1.
Figure 12.2 Data augmentation
To improve the robustness and generalization ability of the pear object detection model, image sample data needs to be augmented. In this study, various augmentation techniques, including adding salt-and-pepper noise, image sharpening, affine transformation, and brightness adjustment, are randomly combined to enhance the images. After data augmentation, the total number of pear samples is . The dataset is split into training set ( images), validation set (293 images), and test set (588 images) with a ratio of 7:1:2. Some of the augmented data samples are shown in Figure 2.
Figure 22.3 Dataset construction
In natural environments, pear fruits are often obstructed by leaves or branches, and fruits can occlude each other, posing significant challenges for robotic harvesting. To improve harvesting efficiency, the harvesting robot can adopt different strategies when encountering pears in various scenarios during the harvesting process. For example, for an unobstructed target, path planning is relatively simple, and conventional path planning and grabbing tasks can be directly applied. When the target is partially occluded, path planning needs to consider how to navigate around the obstruction or adjust the grabbing angle. In environments with dense fruits, where occlusion and overlap of multiple fruits are concerns, multi-object path planning algorithms can be used to devise the optimal path (Gao, ; Yang et al., ). Therefore, based on the growth loci characteristics, the fruits are systematically categorized into three distinct classes in this study. The schematic of the three categories of pears is shown in Figure 3. The first class represents fruits that are not obstructed (referred to as NO). The second class represents fruits that are occluded by branches or leaves (referred to as OBL). The third class represents fruits that are in close contact with other fruits but are not occluded by branches or leaves (referred to as FCC). This classification standard is based on the classification criteria proposed by Nan et al. () for pitaya fruits.
Figure 3The pear fruits in the images were annotated using rectangular bounding boxes in Labeling (Tzutalin, ) software, categorized into three classes (NO, OBL, and FCC) according to the predefined classification criteria. The annotations were formatted in YOLO style and ultimately saved as.txt files. Upon completion of the annotation process, the distribution of different categories across the final training set, validation set, and test set is shown in Table 1.
Table 12.4 AHG-YOLO
The YOLOv11 network introduces two innovative modules, C3k2 and C2PSA, as shown in Figure 4, which further enhance the network’s accuracy and speed. However, in unstructured environments such as orchards, when fruits are severely occluded, overlapping, or when the fruit targets are small, the YOLOv11 network is prone to missing or misdetecting targets. To enhance the accuracy and robustness of pear detection algorithms in unstructured environments, this paper improves the YOLOv11n model. The architecture of the improved model is shown in Figure 5. First, in both the backbone and head networks, the downsampling method is replaced with ADown (Wang et al., ), enabling the model to capture image features at higher levels, enhancing the feature extraction capability of the network and reducing computational complexity. Then, a lightweight detection head, Detect_Efficient, is designed, which further reduces the computational load by sharing the detection head and incorporating group convolution, while improving the network’s feature extraction capacity. Finally, the CIou loss function of YOLOv11 is replaced with GIoU (Jiang et al., ), which reduces the impact of low-quality samples and accelerating the convergence of the network model. The proposed improvements are named AHG-YOLO, derived from the first letters of the three improvement methods: ADown, Head, and GIoU. The AHG-YOLO model effectively improves pear detection performance and better adapts to the detection needs of small targets, occlusion, and fruit overlap in the complex natural environment of pear orchards.
Figure 4Figure 52.4.1 ADown
The ADown module in YOLOv9 is a convolutional block for downsampling in object detection tasks. As an innovative feature in YOLOv9, it provides an effective downsampling solution for real-time object detection models, combining lightweight design and flexibility. In deep learning models, downsampling is a common technique used to reduce the spatial dimensions of feature maps, enabling the model to capture image features at higher levels while reducing computational load. The ADown module is specifically designed to perform this operation efficiently with minimal impact on performance.
The main features of the ADown module are as follows: (1) Lightweight design: The ADown module reduces the number of parameters, which lowers the model’s complexity and enhances operational efficiency, especially in resource-constrained environments. (2) Information preservation: Although ADown reduces the spatial resolution of feature maps, its design ensures that as much image information as possible is retained, allowing the model to perform more accurate object detection. (3) Learnable capabilities: The ADown module is designed to be learnable, meaning it can be adjusted according to different data scenarios to optimize performance. (4) Improved accuracy: Some studies suggest that using the ADown module not only reduces the model size but also improves object detection accuracy. (5) Flexibility: The ADown module can be integrated into both the backbone and head of YOLOv9, offering various configuration options to suit different enhancement strategies. (6) Combination with other techniques: The ADown module can be combined with other enhancement techniques, such as the HWD (Wavelet Downsampling) module, to further boost performance. The ADown network structure is shown in Figure 6.
Figure 6By introducing the ADown module into YOLOv9, a significant reduction in the number of parameters can be achieved, while maintaining or even improving object detection accuracy. Consequently, this study explores the integration of the ADown module into the YOLOv11 network structure to further enhance detection performance.
2.4.2 Detection head re-design
The detection decoupled head structure of YOLOv11n is shown in Figure 7. The extracted feature map is passed through two branches. One branch undergoes two 3×3 convolutions followed by a 1×1 convolution, while the other branch undergoes two depth wise separable convolutions (DWConv), two 3×3 convolutions, and a 1×1 convolution. These branches are used to independently predict the bounding box loss and the classification function.
Figure 7In YOLOv11, there are three of the aforementioned decoupled head structures, which perform detection on large, medium, and small feature maps. However, 3×3 convolutions, while increasing the channel depth, lead to a significant increase in the number of parameters and floating-point operations (Shafiq and Gu, ). Therefore, this study aims to implement a lightweight design for YOLOv11’s detection head while maintaining detection accuracy:
(1) Introducing Group Convolutions to Replace 3×3 Convolutions.
Group Convolution is a convolution technique used in deep learning primarily to reduce computation and parameter quantities while enhancing the model’s representational power. Group convolution works by dividing the input feature map and convolution kernels into several groups. Each group performs its convolution operation independently, and the results are then merged. This process reduces the computation and parameter quantities while maintaining the same output size.
In traditional convolution operations, the convolution is applied across every channel of the input feature map. Assuming the input feature map has dimensions Cin×H×W (Cin is the number of input channels, H is the feature map height, and W is the feature map width), and the convolution kernel has dimensions Cout×Cin×k×k (Cout is the number of output channels and k×k is the spatial dimension of the kernel), the computation for a single convolution operation is: Cout×Cin×k×k×H×W.
In group convolution, the input channels are divided into g groups, and independent convolution operations are performed within each group. In this case, for each group, the number of input channels becomes Cin/g, and the computation becomes: Cout×Cin/g×k×k×H×W.
Group convolution can greatly reduce the number of parameters, enhance the model’s representational power, and avoid overfitting. Therefore, the 3×3 convolutions in the detection head are replaced with group convolutions.
(2) Shared Convolution Parameters.
To further reduce the parameters and computation of the detection head, the two branch inputs of the detection head share two group convolutions, named Detect_Efficient, with the structure shown in Figure 8. By sharing the same convolution kernel weights during loss calculation, redundant computation of similar feature maps is avoided, which further reduces the computation and effectively improves computational efficiency, accelerating the entire model inference process.
Figure 82.4.3 GIoU loss function
The boundary box loss function is an important component of the object detection loss function. A well-defined boundary box loss function can significantly improve the performance of object detection models. In YOLOv11, CIoU is used as the regression box loss function. Although CIoU improves upon GIoU by introducing center distance and aspect ratio constraints, the additional constraints introduced by CIoU might lead to overfitting or convergence difficulties in orchard data collection, where there is a large variation in target size (due to close and distant objects) and where the aspect ratio differences of pear fruit bounding boxes are not significant. Moreover, compared to GIoU, the calculation of the aspect ratio parameter v in CIoU is relatively more complex (Zheng et al., ), resulting in higher computational costs during training and slower model convergence. Therefore, this study replaces CIoU with the GIoU loss function. The GIoU loss function is used in object detection to measure the difference between the predicted and ground truth boxes, addressing the issue where traditional IoU fails to provide effective gradient feedback when the predicted box and the ground truth box do not overlap. This improves the model’s convergence and accuracy. GIoU loss not only considers the overlapping region between boxes but also takes into account the spatial relationship between the boxes by introducing the concept of the minimal enclosing box. This allows the model to learn the shape and position of the boxes more accurately, ultimately enhancing the performance of object detection.
2.5 Experimental environment and parameter settings
The experimental environment for this study runs on the Windows 10 operating system, equipped with 32 GB of memory and an NVIDIA GeForce RTX GPU, with an Intel(R) Core(TM) i7-F @2.10GHz processor. The deep learning framework used is PyTorch 2.0.1, with CUDA 11.8 and CUDNN 8.8.0.
The network training parameters are set as follows: The image input size is 640 × 640, and the batch size is set to 32; the maximum number of iterations is 200. The optimizer is SGD, with the learning rate dynamically adjusted using a cosine annealing strategy. The initial learning rate is set to 0.01, the momentum factor is 0.937, and the weight decay coefficient is 0..
2.6 Evaluation metrics
Object detection models should be evaluated using multiple metrics to provide a comprehensive assessment of their performance. To evaluate the performance of ADG-YOLO, seven metrics are used: precision, recall, average precision (AP), mean average precision (mAP), number of parameters, model size, and GFLOPs. These metrics offer a well-rounded evaluation of ADG-YOLO’s performance in the multi-category pear fruit detection task within the complex environment of a pear orchard. They reflect the model’s performance across various dimensions, including accuracy, recall, speed, and efficiency. The formulas for calculating the relevant performance metrics are provided, as shown in Equations 1-4.
Where TP represents the number of true positive samples that the model correctly predicted as positive, FP represents the number of false positive samples that the model incorrectly predicted as positive, and FN represents the number of false negative samples that the model incorrectly predicted as negative. AP refers to the area under the Precision-Recall (P-R) curve, while mAP refers to the mean value of the AP for each class.
3 Results
3.1 Ablation experiment
To evaluate the effectiveness and feasibility of the proposed AHG-YOLO model in detecting pear fruits with no occlusion, partial occlusion, and fruit overlap, an ablation experiment was conducted based on the YOLOv11n model. Each improvement method and the combination of two improvement methods were added to the YOLOv11n model and compared with the AHG-YOLO model. In the experiment, the hardware environment and parameter settings used for training all models remained consistent. Table 2 shows the ablation experiment results of the improved YOLOv11n model and the AHG-YOLO model on the test set. After introducing the ADown downsampling module to enhance the feature extraction capability of the YOLOv11 network, the model’s precision, recall, AP, and mAP@0.5:0.95 increased by 2.2%, 2.6%, 1.8%, and 2.1%, respectively. The model’s parameter count decreased by 18.6%, GFLOPs decreased by 15.9%, and model size decreased by 17.3%. This indicates that the ADown module can effectively improve the pear object detection accuracy. After introducing the EfficientHead detection head, although the model’s precision, recall, and AP decreased slightly, mAP@0.5:0.95 increased by 0.1%, the model’s parameter count reduced by 10.4%, GFLOPs reduced by 19.0%, and model size decreased by 9.62%. This suggests that EfficientHead plays a significant role in model lightweighting. As shown in Table 2, after introducing the ADown module and GIoU, although the model’s parameter count increased, precision, recall, mAP@0.5, and mAP@0.5:0.95 increased by 1.5%, 1.7%, 0.4%, and 1.6%, respectively. After introducing the ADown module and EfficientHead, precision, recall, mAP@0.5, and mAP@0.5:0.95 increased by 2.3%, 2.2%, 1.7%, and 2.5%, and the model’s parameter count, GFLOPs, and model size all decreased. Additionally, after introducing EfficientHead and GIoU, recall, mAP@0.5, and mAP@0.5:0.95 all increased compared to their individual introduction, without increasing the parameter count. Finally, the proposed AHG-YOLO network model outperforms the original YOLOv11 model, with precision, recall, mAP@0.5, and mAP@0.5:0.95 improving by 2.5%, 3.6%, 2.3%, and 2.6%, respectively. Meanwhile, GFLOPs are reduced to just 4.7, marking a 25.4% decrease compared to the original YOLOv11n, the parameter count decreased by 16.9%, and the model size is only 5.1MB.
Table 2According to the data in Table 2, the mAP@0.5 of YOLOv11-A reached 93.6%, an improvement over the baseline model YOLOv11n. However, when H or G were added individually, the mAP@0.5 dropped to 90.7% and 90.8%, respectively. When combined with the A module, the mAP values increased again. The reasons for this can be analyzed as follows: The ADown module significantly improves baseline performance by preserving discriminative multi-scale features through adaptive downsampling. The EfficientHead method reduces model parameters and computational load compared to the baseline model, but the simplified model structure leads to information loss and a decrease in detection accuracy. GIoU performs poorly on bounding box localization in raw feature maps, resulting in a drop in detection accuracy. When combined with ADown, the ADown module optimizes the features, providing better input for the subsequent EfficientHead and GIoU, thus leveraging the complementary advantages between the modules. The optimized features from ADown reduce the spatial degradation caused by EfficientHead, maintaining a mAP@0.5 of 93.5%, while reducing GFLOPs by 11.3%. ADown’s noise suppression allows GIoU to focus on key geometric deviations, improving localization robustness. The synergy of all three modules achieves the best accuracy-efficiency balance (94.1% mAP@0.5, 4.7 GFLOPs), where ADown filters low-level redundancies, EfficientHead enhances discriminative feature aggregation, and GIoU refines boundary precision. This analysis shows that H and G are not standalone solutions, they require the preprocessing from ADown to maximize their effectiveness.
Figures 9 and 10 show the performance of AHG-YOLO compared to YOLOv11n during the training process. From Figures 9, 10, it can be seen that during 200 training iterations, the proposed AHG-YOLO achieves higher detection accuracy and obtains lower loss values compared to YOLOv11n. This indicates that the AHG-YOLO network model can effectively improve the detection accuracy of pears in unstructured environments and reduce the false detection rate.
Figure 9Figure 10The Grad-CAM (Selvaraju et al., ) method is used to generate heatmaps to compare the feature extraction capabilities of the YOLOv11n model and the AHG-YOLO model in complex scenarios such as overlapping fruits, small target fruits, and fruit occlusion, as shown in Figure 11. Figure 11 shows that the AHG-YOLO model exhibits better performance in complex scenarios. The specific quantitative results comparison can be found in Section 3.3.
Figure 11To further validate the detection performance, experiments were conducted on the test dataset for both the YOLOv11n model and the AHG-YOLO model. The detection results are shown in Figure 12, where the red circles represent duplicate detections and the yellow circles represent missed detections. By comparing Figures 12A, B, it can be observed that YOLOv11n has one missed detections. By comparing Figures 12C, D, it can be seen that YOLOv11n has one duplicate detections. By comparing Figures 12E, F, it can be seen that YOLOv11n has two duplicate detections and one missed detections. This demonstrates that AHG-YOLO can accurately perform multi-class small object detection and classification in complex environments, exhibiting high accuracy and robustness, and effectively solving the pear detection problem in various scenarios within complex environments.
Figure 123.2 Detection results of pear targets in different classes
Figure 13 shows the AP results for multi-category detection for occluded pear fruits in complex orchard scenes by different networks on the test set. Table 3 presents the specific detection results of AHG-YOLO and YOLOv11n for different categories of pear targets on the test set. From Figure 13 and Table 3, it can be observed that the base YOLOv11 network performs best in detecting NO fruit, with an AP value of 93.4%, but performs relatively poorly when detecting FCC and OBL fruits. The proposed AHG-YOLO model improves the AP for detecting FCC fruits by 2.6%, reaching 93.4%, improves the AP for detecting OBL fruits by 2.4%, reaching 93.5%, and improves the AP for detecting NO fruits by 1.9%, reaching 95.3%. This indicates that the proposed method is highly effective for fruit target detection in complex environments, demonstrating both excellent accuracy and robustness.
Figure 13Table 33.3 Comparison with mainstream object detection models
AHG-YOLO was compared with other mainstream object detection networks, and the detection results on the test set are shown in Table 4. The experimental results of all models indicate that YOLOv9c achieves the highest precision, mAP@0.5, and mAP@0.5:0.95 among all models. However, the YOLOv9c model has excessively large parameters, GFLOPs, and model size, making it unsuitable for real-time detection in harvesting robots. AHG-YOLO’s mAP@0.5 surpasses that of Faster R-CNN, RTDETR, YOLOv3, YOLOv5n, YOLOv7, YOLOv8n, YOLOv10n, and YOLOv11n by 15.1%, 0.9%, 2.4%, 3.9%, 12.6%, 3.4%, 5.2%, and 2.3%, respectively. In terms of precision, recall, mAP@0.5:0.95, and GFLOPs, AHG-YOLO also shows advantages. Therefore, based on a comprehensive comparison of all metrics, AHG-YOLO is better suited for pear target detection tasks in complex environments.
Table 44 Discussion
YOLO series detection algorithms are widely used in fruit detection due to their high detection accuracy and fast detection speed. These algorithms have been applied to various fruits, such as tomatoes (Wu H, et al., ), kiwifruits (Yang et al., ), apples (Wu M, et al., ), achieving notable results. Researchers have always been focused on designing lightweight algorithms, and this is also true for pear fruit target detection. Tang et al. () proposed a pear target detection method based on an improved YOLOv8n for fragrant pears. Using their self-built fragrant pear dataset, they improved the F0.5-score and mAP by 0.4 and 0.5 percentage points compared to the original model, reaching 94.7% and 88.3%, respectively. Li et al. () introduced the advanced multi-scale collaborative perception network YOLOv5sFP for pears detection, achieving an AP of 96.12% and a model size of 50.01 MB.
While these studies have achieved remarkable results, they did not address the practical needs of robotic harvesting, as they focused solely on detecting a single class of pear fruits. This study takes into account the detection requirements for robotic harvesting, categorizing pear fruits in orchards into three groups (ON, OBL, FCC) to enable the harvesting robot to develop different harvesting strategies based on conditions of no occlusion, branch and leaf occlusion, and fruit overlap, thus improving harvesting efficiency. Compared to commonly used detection models, the AHG-YOLO proposed in this study achieves the highest detection accuracy in complex orchard environments, with an mAP of 94.1%.
Figure 14 shows three examples of detection errors when using AHG-YOLO for multi-category detection of occluded pear fruits. The potential causes of these errors are as follows: (1) In cloudy, dim lighting conditions, when fruits are tightly clustered and located at a distance, the fruit targets appear small, making feature extraction challenging. This leads to repeated detection of FCC fruits, as seen in the lower right red circle of Figure 14A. Additionally, the dim lighting causes the occluded pear’s features to resemble those of the leaves, resulting in the model mistakenly detecting leaves as OBL fruits, as shown in the upper left red circle of Figure 14A. (2) When the target is severely occluded, the model struggles with feature extraction, which may lead to either missed detections or repeated detection, as shown in Figure 14B. The yellow bounding box indicates a missed detection, and the red circle indicates a repeated detection. (3) Detecting FCC fruits is particularly challenging because the fruits are often clustered together, making it difficult to distinguish between them. Furthermore, the fruit bags sometimes interfere with the detection process, causing errors, as seen in Figure 14C, where the bag is incorrectly detected as an FCC fruit.
Figure 14To enhance the accuracy of AHG-YOLO in detecting multi-category detection for occluded pear fruits, the following measures can be taken: (1) Increase the number of samples that are prone to detection errors, such as FCC and OBL class samples, to diversify the dataset and improve the model’s detection capability in complex environments. (2) Further refine the model’s feature extraction capability, particularly for detecting small targets.
Although the AHG-YOLO model has some limitations in detecting multi-category detection for occluded pear fruits, it achieves an overall detection mAP of 94.1%, which meets the fruit detection accuracy requirements for orchard automation in harvesting. This provides crucial technical support for robotic pear harvesting in orchards. The AHG-YOLO model will be applied to the visual detection system of pear fruit-picking robots to validate its reliability.
5 Conclusion
This paper proposes the AHG-YOLO network model for multi-category detection of occluded pear fruits in complex orchard scenes. Using YOLOv11n as the base model, the ADown downsampling method, lightweight detection head, and GIoU loss function are integrated to enhance the network’s feature extraction capability and reduce the model’s complexity, making it suitable for real-time harvesting applications. The conclusions are as follows:
(1) Experimental results in complex pear orchard environments demonstrate that the mAP of AHG-YOLO for multi-category detection for occluded pear fruits is 94.1%, with the AP for FCC, OBL, and NO fruits being 93.4%, 93.5%, and 95.3%, respectively. Compared to the base YOLOv11n network, precision, recall, mAP@0.5, and mAP@0.5:0.95 improved by 2.5%, 3.6%, 2.3%, and 2.6%, respectively. Additionally, GFLOPs are reduced to 4.7, representing a 25.4% decrease compared to the original YOLOv11n, while the number of parameters is reduced by 16.9%, and the model size is just 5.1MB.
(2) Compared with eight other commonly used object detection methods, AHG-YOLO achieves the highest detection accuracy while maintaining a lightweight design. The mAP@0.5 is 15.1%, 0.9%, 2.4%, 3.9%, 12.6%, 3.4%, 5.2%, and 2.3% higher than Faster R-CNN, RTDETR, YOLOv3, YOLOv5n, YOLOv7, YOLOv8n, YOLOv10n, and YOLOv11n, respectively, thereby meeting the real-time harvesting requirements of orchards.
In summary, the AHG-YOLO model proposed in this paper provides a solid methodological foundation for real-time pear target detection in orchard environments and supports the development of pear-picking robots. Future work will focus on further validating the effectiveness of the method in pear orchard harvesting robots, with ongoing optimization efforts.
Data availability statement
The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.
Author contributions
NM: Writing – original draft, Writing – review & editing, Conceptualization, Investigation, Software, Supervision, Visualization. YS: Data curation, Investigation, Software, Validation, Visualization, Writing – original draft. CL: Data curation, Visualization, Writing – review & editing. ZL: Data curation, Software, Writing – review & editing. HS: Conceptualization, Funding acquisition, Supervision, Visualization, Writing – original draft, Writing – review & editing.
Funding
The author(s) declare that financial support was received for the research and/or publication of this article. This research was supported by the key R&D Program of Shanxi Province (CYJSTX07-23); the Fundamental Research Program of Shanxi Province (No. ).
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Generative AI was used in the creation of this manuscript.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Alif, M. and Hussain, M. (). YOLOv1 to YOLOv10: A comprehensive review of YOLO variants and their application in the agricultural domain. arxiv preprint arxiv:.. 13. doi: 10./arXiv..
Crossref Full Text | Google Scholar
Bharad, N. and Khanpara, B. (). Agricultural fruit harvesting robot: An overview of digital agriculture. Plant Arch. 24, 154–160. doi: 10./PLANTARCHIVES..v24.SP-GABELS.023
Crossref Full Text | Google Scholar
Chen, Z., Lei, X., Yuan, Q., Qi, Y., Ma, Z., Qian, S., et al. (). Key technologies for autonomous fruit-and vegetable-picking robots: A review. Agronomy 14, 1-2. doi: 10./agronomy
Crossref Full Text | Google Scholar
Dhanya, V., Subeesh, A., Kushwaha, N., Vishwakarma, D., Kumar, T., Ritika, G., et al. (). Deep learning based computer vision approaches for smart agricultural applications. Artif. Intell. Agric. 6, 211–229. doi: 10./j.aiia..09.007
Crossref Full Text | Google Scholar
Gao, X. (). Research on Path Planning of Apple Picking Robotic Arm Based on Algorithm Fusion and Dynamic Switching [D]. Hebei Agricultural University. doi: 10./d.cnki.ghbnu..
Crossref Full Text | Google Scholar
Horvat, M., Jelečević, L., and Gledec, G. (). “A comparative study of YOLOv5 models performance for image localization and classification,” in Central European Conference on Information and Intelligent Systems. Varazdin, Croatia: Faculty of Organization and Informatics Varazdin. 349–356.
Google Scholar
Jiang, K., Itoh, H., Oda, M., Okumura, T., Mori, Y., Misawa, M., et al. (). Gaussian affinity and GIoU-based loss for perforation detection and localization from colonoscopy videos. Int. J. Comput. Assisted Radiol. Surg. 18, 795–805. doi: 10./s-022--x
PubMed Abstract | Crossref Full Text | Google Scholar
Khanam, R. and Hussain, M. (). Yolov11: An overview of the key architectural enhancements. arxiv preprint arxiv:.. 3–7. doi: 10./arXiv..
Crossref Full Text | Google Scholar
Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., et al. (). YOLOv6: A single-stage object detection framework for industrial applications. arxiv preprint arxiv:.. doi: 10./arXiv..
Crossref Full Text | Google Scholar
Liu, Z., Abeyrathna, R., Sampurno, R., Nakaguchi, V., and Ahamed, T. (). Faster-YOLO-AP: A lightweight apple detection algorithm based on improved YOLOv8 with a new efficient PDWConv in orchard. Comput. Electron. Agric. 223, . doi: 10./j.compag..
Crossref Full Text | Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., et al. (). “Ssd: Single shot multibox detector,” in Computer Vision–ECCV : 14th European Conference, Amsterdam, The Netherlands, October 11–14, , Proceedings, Part I 14. Cham, Switzerland: Springer International Publishing. 21–37.
Google Scholar
Liu, J. and Liu, Z. (). The vision-based target recognition, localization, and control for harvesting robots: A review. Int. J. Precis. Eng. Manufacturing 25, 409–428. doi: 10./s-023--7
Crossref Full Text | Google Scholar
Lu, Y., Du, S., and Ji, Z. (). ODL Net: Object detection and location network for small pears around the thinning period. Comput. Electron. Agric. 212, . doi: 10./j.compag..
Crossref Full Text | Google Scholar
Nan, Y., Zhang, H., Zeng, Y., Zheng, J., and Ge, Y. (). Intelligent detection of Multi-Class pitaya fruits in target picking row based on WGB-YOLO network. Comput. Electron. Agric. 208, . doi: 10./j.compag..
Crossref Full Text | Google Scholar
Parsa, S., Debnath, B., and Khan, M. (). Modular autonomous strawberry picking robotic system. J. Field Robotics 41, –. doi: 10./rob.
Crossref Full Text | Google Scholar
Ren, S., He, K., Girshick, R., and Sun, J. (). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, –. doi: 10./TPAMI..
PubMed Abstract | Crossref Full Text | Google Scholar
Ren, R., Sun, H., Zhang, S., Wang, N., Lu, X., Jing, J., et al. (). Intelligent Detection of lightweight “Yuluxiang” pear in non-structural environment based on YOLO-GEW. Agronomy 13, . doi: 10./agronomy
Crossref Full Text | Google Scholar
Selvaraju, R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., and Batra, D. (). Grad-CAM: Why did you say that? arxiv preprint arxiv:.. 2-4. doi: 10./arXiv..
Crossref Full Text | Google Scholar
Seo, H., Sawant, S., and Song, J. (). Fruit cracking in pears: its cause and management—a review. Agronomy 12, . doi: 10./agronomy
You will get efficient and thoughtful service from Hebei Xingtai.
Additional resources:How to Choose plastic ultrasonic water meter?
What is EPS Moulding? - A Comprehensive Guide
Crossref Full Text | Google Scholar
Shafiq, M. and Gu, Z. (). Deep residual learning for image recognition: A survey. Appl. Sci. 12, . doi: 10./app
Crossref Full Text | Google Scholar
Shi, Y., Duan, Z., and Qing, S. (). YOLOV9S-Pear: A lightweight YOLOV9S-based improved model for young Red Pear small-target recognition. Agronomy 14, . doi: 10./agronomy
Crossref Full Text | Google Scholar
Sohan, M., Sai Ram, T., and Rami Reddy, C. (). A review on yolov8 and its advancements. Int. Conf. Data Intell. Cogn. Inf., 529–545. doi: 10./978-981-99--2_39
Crossref Full Text | Google Scholar
Sun, H., Wang, B., and Xue, J. (). YOLO-P: An efficient method for pear fast detection in complex orchard picking environment. Front. Plant Sci. 13. doi: 10./fpls..
PubMed Abstract | Crossref Full Text | Google Scholar
Tang, Y., Qiu, J., Zhang, Y., Wu, D., Cao, Y., Zhao, K., et al. (). Optimization strategies of fruit detection to overcome the challenge of unstructured background in field orchard environment: A review. Precis. Agric. 24, –. doi: 10./s-023--9
Crossref Full Text | Google Scholar
Tang, Z., Xu, L., Li, H., Chen, M., Shi, X., Zhou, L., et al. (). YOLOC-tiny: a generalized lightweight real-time detection model for multiripeness fruits of large non-green-ripe citrus in unstructured environments. Front. Plant Sci. 15. doi: 10./fpls..
PubMed Abstract | Crossref Full Text | Google Scholar
Tzutalin, D. (). LabelImg. GitHub repository 6, 4. Available online at: https://github.com/tzutalin/labelImg.
Google Scholar
Vrochidou, E., Tsakalidou, V., Kalathas, I., Gkrimpizis, T., Pachidis, T., and Kaburlasos, V. (). An overview of end efectors in agricultural robotic harvesting systems. Agriculture 12, . doi: 10./agriculture
Crossref Full Text | Google Scholar
Wang, C., Bochkovskiy, A., and Liao, H. (). “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, New Jersey, USA. –.
Google Scholar
Wang, J., Gao, K., Jiang, H., and Zhou, H. (). Method for detecting dragon fruit based on improved lightweight convolutional neural network. Nongye Gongcheng Xuebao/ Trans. Chin. Soc Agric. Eng. 36, 218–225. doi: 10./j.issn.-..20.026
Crossref Full Text | Google Scholar
Wang, C., Yeh, I., and Mark, L. (). “Yolov9: Learning what you want to learn using programmable gradient information,” in European conference on computer vision. 1–21, XXXH. Y. doi: 10./arXiv..
Crossref Full Text | Google Scholar
Wei, J., Ni, L., Luo, L., Chen, M., You, M., Sun, Y., et al. (). GFS-YOLO11: A maturity detection model for multi-variety tomato. Agronomy 14, . doi: 10./agronomy
Crossref Full Text | Google Scholar
Wu, M., Lin, H., Shi, X., Zhu, S., and Zheng, B. (). MTS-YOLO: A multi-task lightweight and efficient model for tomato fruit bunch maturity and stem detection. Horticulturae 10, . doi: 10./horticulturae
Crossref Full Text | Google Scholar
Wu, H., Mo, X., Wen, S., Wu, K., Ye, Y., Wang, Y., et al. (). DNE-YOLO: A method for apple fruit detection in Diverse Natural Environments. J. King Saud University-Computer Inf. Sci. 36, . doi: 10./j.jksuci..
Crossref Full Text | Google Scholar
Yang, J., Ni, J., Li, Y., Wen, J., and Chen, D. (). The intelligent path planning system of agricultural robot via reinforcement learning. Sensors 22, . doi: 10./s
PubMed Abstract | Crossref Full Text | Google Scholar
Yang, Y., Su, L., Zong, A., Tao, W., Xu, X., Chai, Y., et al (). A New Kiwi Fruit Detection Algorithm Based on an Improved Lightweight Network. Agriculture 14 (10), . doi: 10./agriculture
Crossref Full Text | Google Scholar
Zhang, J., Kang, N., Qu, Q., Zhou, L., and Zhang, H. (). Automatic fruit picking technology: a comprehensive review of research advances. Artificial Intelligence Review 57 (3), 54. doi: 10./s-023--2
Crossref Full Text | Google Scholar
Zhao, P., Zhou, W., and Na, L. (). High-precision object detection network for automated pear picking. Sci. Rep. 14, . doi: 10./s-024--6
PubMed Abstract | Crossref Full Text | Google Scholar
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D. (). “Distance-IoU loss: Faster and better learning for bounding box regression,” in Proceedings of the AAAI conference on artificial intelligence, Vol. 34. –. Menlo Park, California, USA. doi: 10./aaai.v34i07.
Crossref Full Text | Google Scholar
WCS-YOLOv8s: an improved YOLOv8s model for target ... - Frontiers
1 Introduction
Strawberry is a common fruit, with a sweet taste that is loved by people, and is known as the ‘Queen of Fruits’. Strawberries are rich in free sugars, organic acids, and other important ingredients that have health benefits such as protecting one’s eyesight and promoting digestion and have anti-inflammatory properties (Ikegaya, ; Newerli-Guz et al., ; Gasparrini et al., ; Afrin et al., ). At present, there is an increasing demand for strawberry fruits, however, due to the complexity of fruit identification and localization during strawberry growth, the level of intelligent and mechanized strawberry fruit picking is still very low, and relying on manual completion is increasingly failing to satisfy the market's demand for strawberries (Zhao et al., ; Ibba et al. ).
Flower and fruit thinning is an important part of orchard management, directly affecting fruit yield and quality and preventing early plant failure. During the strawberry growth process, the best times for flower thinning and fruit thinning are the bud and flower stages. Research has shown that rational flower and fruit thinning can remove deformed, diseased, and excessive fruits, helping to regulate the plant’s nutrient supply to the fruits, improving fruit quality, and increasing the yield by 20%–30% (Domingos et al., ; Yu et al., ). The key to automating flower thinning, fruit thinning, and picking is to achieve target identification and localization of strawberries (Castle et al., ). Most of the current research focused on the detection of ripe strawberries, while less research had been conducted on the strawberry bud and flower stages. The research in this paper included the bud stage and blossom stage during strawberry growth, which could provide technical support for the realization of automated flower thinning, fruit thinning, and picking of strawberries.
Computer vision has been widely applied in agriculture, food, transportation, and other fields (Zhao et al., ; Zou et al., ; Yang et al., ; Babu et al., ; Singh et al., ). The use of computer vision technology to identify strawberries has broad application potential and provides theoretical support for robot picking and automated orchard management in strawberry production. Currently, strawberry picking mainly relies on manual labor, where workers rely on their own experience to identify and pick ripe strawberries. However, due to inconsistent evaluation standards and the diversity of strawberry varieties, the optimal picking period is often missed (Ge et al., ). Traditional methods for detecting flowers and fruits mainly involve machine vision techniques that autonomously extract features such as shape, texture, and size based on human experience (Rizon et al., ). For example, Lin et al. proposed a support vector machine (SVM) model for identifying citrus and tomatoes based on color and contour features (Lin et al., ). Guru et al. achieved flower classification by using threshold segmentation methods and feature extraction on flower images (Guru et al., ). Xu et al. used hue, saturation, value (HSV) to detect strawberries. color information to detect strawberry regions and combined this information with an SVM classifier (Xu et al., ). Although these methods offer some solutions, the manual extraction of features based on personal experience makes it difficult to extract deep feature information from images, resulting in lower robustness and recognition accuracy of models built using traditional machine vision techniques (Ma et al., ). In contrast, deep learning technology, by extracting deeper features from image data, has improved the accuracy and speed of object detection in complex environments (Wang et al., ). Deep learning technology has been widely applied in the detection research on strawberry, apple, and other fruit flowers for maturity and yield (Guo et al., ; Ismail and Malik, ; Wang et al., ; Wang and He, ). Font et al. developed a computer vision system based on color and specular reflection patterns for the rapid and accurate estimation of apple orchard yields. However, this system had the drawback of relying on artificial lighting at night to reduce the influence of natural light (Font et al., ). Lin et al. established a strawberry flower detection algorithm based on Faster R-CNN, achieving the detection of strawberry flowers in outdoor environments with overlapping flowers and complex backgrounds (Lin et al., ). Zhang et al. reduced the number of convolutional layers and CBL modules in the CSPNet backbone and established a real-time strawberry monitoring algorithm based on YOLOv4 Tiny, achieving rapid and real-time detection of strawberries (Zhang et al., ). Binocular cameras have gradually been applied in the research of target recognition and positioning. Qi et al. established a TCYOLO algorithm with CSPDenseNet and CSPResNeXt as the dominant networks, achieving accurate detection of chrysanthemum flowers (Qi et al., ). Hu et al. used a ZED stereo camera to perform three-dimensional positioning of strawberries. The strawberry detection and positioning method proposed in the study can effectively provide the precise location of mature strawberries for picking robots (Hu et al., ). Fu et al. improved the YOLOv3-tiny model and developed an algorithm for the automatic, rapid, and accurate detection of kiwifruit in orchards. The experimental results showed that the improved model is small and efficient, with high detection accuracy (Fu et al., ). Bai et al. built a YOLO real-time recognition algorithm to achieve accurate flower and fruit recognition of strawberry seedlings in a greenhouse (Bai et al., ). However, there has been no research on the use of binocular positioning cameras for target recognition and positioning of the entire growth process of strawberries (bud, flower, unripe, and ripe stages) nor has there been any research addressing the practical needs of orchards for automated thinning of flowers and fruits and the detection and positioning of mature strawberries. Orchards urgently need to achieve automated management of the entire growth process of strawberries.
In this paper, a new model of strawberry target identification and localization based on the YOLOv8s model, named the WCS-YOLOv8s model, is innovatively proposed for the four stages of the strawberry growth process (bud, flower, fruit under-ripening, and fruit ripening stages) that provides supervision of the whole growth process of strawberries. The model provided a reliable new method for target identification and localization for the automated supervision of the whole strawberry growth process, leading to fruit picking and quality improvements. The improvement and innovation points of this paper include:
1. A data enhancement strategy based on the Warmup learning rate is proposed in this paper, which could provide a stable convergence direction for the model and avoid oscillations at the early stage of training.
2. The model introduced the Context Guide Fusion Module (CGFM), which used the multi-head self-attention mechanism to fuse different information and improve the recognition accuracy in complex scenes.
3. The model proposed the Squeeze-and-Excitation-Enhanced Multi-Scale Depthwise Attention (SE-MSDWA) module, which combined multi-scale convolution and SEAttention to enhance the feature extraction efficiency of the samples and significantly improved the detection effect of the model in complex scenes.
2 Materials and methods
2.1 Sample collection and dataset construction
The samples were collected from March to May . The collection site was Hongshiyi strawberry planting orchard in Shandong Province (121.49°E, 36.77°N). A total of 1,957 sample images (image size of 640 × 640 pixels) were collected. The sample varieties included ‘Sweet Treasure’, ‘Red Face’, ‘Fengxiang’, ‘Miaoxiang’, and ‘Zhangji’.
The sample collection tools were a laptop CPU: Intel(R) Xeon(R) E5– v4; GPU: NVIDIA ; and an Intel® RealSense D435i binocular depth camera (Intel®, United States of America; depth resolution and frame rate are ×720 and 90 FPS (maximum), respectively; binocular detection range of 0.105–10 m). The image dataset was acquired utilizing the above laptop, instrument, and the setting parameters. The sample collection method involved collecting sample images using a laptop computer connected to an Intel® D435i camera (shooting distance of 0.3–0.8 m) from 07:00 to 19:00 every day in 10 sessions. Sample data were collected from different growing sheds to eliminate data bias due to geographical location and variety. Images contained bright and shady light and complex environments and backgrounds. The dataset was randomly divided into three subsets: training set, validation set, and testing set, with a ratio of 7:2:1.
This paper classified the samples into four stages according to the fruit growers’ planting experience, namely, the bud, flower, fruit under-ripening, and fruit ripening stages, and collected image data for these four stages. The identification criteria for immature strawberries were that the color of the fruit was light red or green covering a large area, the fruit was not full, and the size was slightly small. The identification criteria for mature strawberries were that the color of the fruit was bright red and the fruit was large and full. Some samples are shown in Figure 1. In Figures 1a–d are strawberry samples in the bud, flower, fruit under-ripening, and fruit ripening stages, respectively.
Figure 12.2 WSC-YOLOv8s model construction
2.2.1 Overall structure of the WSC-YOLOv8s model
YOLOv8 is a powerful real-time object detection algorithm that uses an end-to-end architecture to achieve regression and prediction of a target’s category and location using feature extraction and fusion of input images through convolutional neural networks. The YOLOv8 framework is divided into four main components: input layer, backbone network, neck network, and prediction layer (Simanjuntak et al., ).
In this paper, improvements were made to YOLOv8s. First, Warmup data augmentation was used, i.e., the original data augmentation strategy was changed to gradually increase the probability of data augmentation occurring as the epoch changes. Second, the self-developed SE-MSDWA module was applied at the end of the backbone network to achieve efficient feature extraction, ensuring the model focused on the region of interest. Finally, the neck network was improved by using the CGFM module to enhance the feature fusion performance of the network. Based on the above, this paper constructed the WCS-YOLOv8s model for target identification and localization during the whole strawberry growth process, and the network framework of the constructed model is shown in Figure 2.
Figure 22.2.2 Data enhancement with Warmup
Warmup was first mentioned in ResNet as a way to optimize for learning rate (Nakamura et al., ). The method of using Warmup to warm up the learning rate causes the model to gradually stabilize at a smaller learning rate during the first few epochs of training, and when the model is stabilized, it can then be trained using the pre-set learning rate, which speeds up the convergence of the model and improves the model effect. The initial use of lower probability data transformation helps the model to learn the relationship between samples, improving the adaptability to different data distributions, causing the model to enter the training process smoothly, and avoiding falling into the local optimal solution. Gradually increasing the probability of sample transformation with the training process further enhances the generalization ability of the model and improves the robustness. Through Warmup data enhancement, the model learns and generalizes effectively.
In this paper, data augmentation was performed at the beginning of training using smaller probabilities. When the training proceeded to 1/5 of the total number of rounds, all data enhancements were performed as default in YOLOv8. Equation 1 for the variation of data enhancement probability is shown below:
Where current_epoch is the current round number in the training process and total_epochs is the total number of rounds in the training process.
2.2.3 SE-MSDWA module in the model
The SE-MSDWA module aimed to enhance the feature extraction capability and overall performance of convolutional neural networks by combining depth-separable convolution, multi-scale convolution, and SE blocks. The SE-MSDWA module first performed convolutional operations on each channel independently through deep convolution to extract important spatial information and then interacted with the information between channels through point-by-point convolution. Second, the module used three sets of convolution kernels with different scales to perform multi-scale convolution processing: the ranges of convolution kernels captured by [(1, 5) and (5, 1)], [(1, 9) and (9, 1)], and [(1, 17) and (17, 1)] were smaller, medium, and larger features, respectively. After these convolutional processes, the feature maps were fused with multi-scale information through an additional convolutional layer and finally entered into the SE module. The SE module first performed adaptive average pooling to reduce the feature map of each channel to 1x1, computed the weights of each channel through two fully connected layers and activation functions, and reapplied these weights to the original feature map, as shown in Figure 3.
Figure 3The SE-MSDWA module effectively solved the deficiencies of traditional convolutional layers in capturing multi-scale features and handling feature redundancy. The module significantly enhanced the feature representation capability of the network by dynamically adjusting the channel weights, thus improving the performance of the model in various computer vision tasks. The module enhanced the network’s adaptability in different scenes and tasks through multi-scale feature extraction and attention mechanisms.
2.2.4 CGFM modules in the model
Concat has its limitations and drawbacks in deep learning and cannot give full play to the complementary effects of different features. On the basis of this, this paper proposed the CGFM, a feature fusion structure based on a self-attention mechanism to improve the performance and efficiency of the model. The CGFM is an innovative feature fusion module designed to improve the Feature Pyramid Network (FPN) in YOLOv8s. Through the multi-head self-attention mechanism, the CGFM module adjusts the number of channels by splicing two different feature maps, input1 and input2, in the channel dimension, which are processed by the multi-head self-attention mechanism of convolution after splitting, and then the number of channels is adjusted by convolution. Second, the split data are multiplied by elements with the two inputs respectively and added to the other input to get the blended features. Finally, the two are spliced to achieve feature fusion and cross-interaction, which improves the feature fusion capability of the neck network. The CGFM enhances the important features using the multi-head self-attention mechanism, which effectively suppresses the unimportant features and improves the discriminative power and visual performance of the fused features through detail enhancement. A detailed structure of the CGFM is shown in Figure 4.
Figure 43 Results and discussion
3.1 Experimental platform
Model training and evaluation were performed using the following computers and operating systems. The experimental platform configuration parameters were as follows: CPU: Intel(R) Xeon(R) E5– v4, GPU: NVIDIA , OS: Ubuntu. The programming language used for the experiment was Python 3.8.19. To enhance the efficiency of the model training, a CUDA 11.6 accelerator was introduced. In the experiment, the resolution of the input image was 640 × 640 pixels and 32 samples were processed in batches for each training module. The Adam optimizer was employed with an initial learning rate of 0.01. The learning rate was automatically optimized by the cosine annealing learning rate decay algorithm, and after 100 training cycles, the best model weight parameter file was saved and used for model evaluation (Yang et al., ).
3.2 Evaluation indicators
To comprehensively evaluate the performance of the constructed model, this paper introduced multiple evaluation metrics to quantify both the model’s effectiveness and its resource consumption in practical applications. The evaluation metrics employed in this paper encompass precision (P), recall (R), average precision (AP), mean average precision (mAP), model parameters, floating point operations per second (FLOPs), and detection frame rate (FPS). With Intersection over Union (IoU), it is possible to visualize the degree of match between the target detection results and the real situation.
P measures the proportion of correctly detected targets to all detected targets by the model and reflects the accuracy of the model in identifying positive class objects. Equation 2 is calculated as follows:
Where true positive (TP) denotes the number of positive samples recognized, false positive (FP) denotes the number of negative samples misreported, and FN denotes the number of positive samples missed. N denotes the number of sample categories.
R represents the proportion of targets correctly detected by the model to all actual positive class targets, revealing the model’s ability to cover positive class samples. Equation 3 is calculated as follows:
AP is the average of precision rates at different levels of recall and provides an assessment of performance for a single category. The calculation of Equation 4 is as follows:
mAP assesses the performance of multi-category object detection by calculating the average of the AP values across all categories. This metric effectively evaluates the accuracy of the model in detection tasks. mAP@0.5 and mAP@0.5:0.95 are two commonly used metrics for evaluating the mAP and, thus, they were selected for the evaluation in this paper. mAP@0.5 was compared with mAP@0.5:0.95. mAP@0.5:0.95 indicates the average mAP calculated under multiple IoU thresholds (from 0.5 to 0.95 in steps of 0.05). This means that the model’s performance under multiple different IoU thresholds was taken into account, providing a more comprehensive evaluation. In this paper, the whole growth process of the strawberry was divided into four stages, and the four stages correspond to four categories. Target detection and localization of the four categories were performed. In this paper, mAP was employed as a crucial evaluation metric, with mAP@0.5:0.95 serving as the primary assessment criterion to comprehensively evaluate the performance of the enhanced detection model. The calculation of Equation 5 is as follows:
In Equation 5, mAP denotes the value of AP calculated for all images in each category at a set IoU value, averaged over all categories.
The number of parameters was measured in megabytes (M), which quantifies the size of the model and the consumption of memory resources and is an important metric for evaluating the complexity of the model.
Giga FLOPs (GFLOPs) is a quantity that measures the computational power of a computer.
FPS refers to how many frames of image the model can process per second, which directly reflects the detection speed of the model in frames per second (frames/s). The larger the value of this indicator, the better.
3.3 Comparative experiments
To validate the enhanced performance of the proposed model, a comparative analysis was conducted between the constructed model and the prevailing mainstream model. The results of this comparison experiment are presented in Table 1. The mAP@0.5:095 of the improved YOLOv8s model in this paper was 58.87. This was the highest value and was significantly higher than the Centernet model and the Rtmdet-tiny model. The results showed that the improved WCS-YOLOv8s model was the most effective for target identification and localization during the whole strawberry growth process. The FPS of the WCS-YOLOv8s, YOLOv8s, and YOLOv6s models were all higher than 100, significantly higher than those of the other models. This indicates that the prediction speeds were faster, which met the needs of automated online detection. The model parameter numbers of the YOLOv8n and YOLOv8s models were 3.15M and 11.16M, respectively. The difference between the parameter numbers of the two models was small. The FPS values of the models were close to each other. The mAP@0.5 and mAP@0.5:0.95 values of YOLOv8s were 85.59 and 58.87, respectively. The highest mAP@0.5:0.95 value for YOLOv8s was 58.87. Considering all the factors, YOLOv8s was the best baseline model for further improvement research.
Table 13.4 Ablation experiments
In this paper, using YOLOv8s as the baseline model to enhance the model accuracy using Warmup data enhancement method, and fusing the use of the SE-MSDWA module and CGFM, four sets of experiments were set up to ensure the feasibility of the optimization scheme. The findings are presented in Table 2.
Table 2The Warmup data enhancement strategy was incorporated into the baseline model. By maintaining the original structure of the model, this strategy resulted in increases of 0.96%, 1.02%, and 2.3% in mAP@0.5, mAP@0.5:0.95, and recall, respectively. Notably, the computational effort required by the WCS-YOLOv8s model remained unchanged compared to that of the baseline model. These findings suggested that integrating the Warmup data enhancement method effectively enhances the accuracy of the model.
After incorporating the SE-MSDWA module into the backbone network of the benchmark model, there was a notable enhancement in performance metrics. Specifically, the mAP@0.5 and mAP@0.5:0.95 values of the model were improved by 0.86% and 0.59%, respectively. Additionally, the precision of the model exhibited an increase of 1.1%, indicating a significant improvement overall in its precision metrics.
After incorporating the CGFM module into the neck structure of the benchmark model, we observed an improvement in recall by 0.1%. Additionally, the precision of the new model showed a significant enhancement of 0.88% for mAP@0.5 and 0.9% for mAP@0.5:0.95 compared to the improved model; however, it is important to note that this represents a decrease of 0.4% in precision when compared to the benchmark model itself. This study focused on multi-target detection and placed greater emphasis on the mAP@0.5:0.95 metric, indicating that the integration of the CGFM module further enhanced the effectiveness of our proposed model.
As shown in Table 1, the YOLOv8s benchmark model comprised 11.16 M parameters and achieved a frame rate of 102.3 FPS. In contrast, the WCS-YOLOv8s improved model proposed in this paper had an increased parameter count of 18.69 M, representing an augmentation of 7.53 million parameters. The detection speed of this enhanced model was recorded at 45.9 FPS (21.8 ms per image), which sufficiently meets the requirements for automated real-time detection applications. Moreover, the precision rate, recall, mAP@0.5, and mAP@0.5:0.95 of the detection of the WCS-YOLOv8s model were 83.4%, 86.7%, 87.53%, and 60.48%, respectively. Thus, the WCS-YOLOv8s model improved on mAP@0.5, mAP@0.5:0.95, and recall by 1.94%, 1.61%, and 2.4%, respectively, and the detection accuracy rate was significantly improved. The fact that the model had the best results for each index indicated the effectiveness of the improvement. The WCS-YOLOv8s model effectively reduced the omission and misdetection of the baseline model in complex situations and improved the detection accuracy of target identification and localization during the whole process of strawberry growth.
3.5 Detection effect of the WCS-YOLOv8s model
To enhance the evaluation of the effectiveness of the WCS-YOLOv8s model developed in this study, a comparative analysis was conducted with several current mainstream object detection models, including the YOLOv8s, CenterNet, Rtmdet-tiny, and YOLOv5s models. The results were visualized for clarity. Three images depicting strawberries in various scenes from the dataset constructed in this paper (Figures 5a–c) were analyzed to assess their recognition outcomes.
Figure 5Figure 5 shows the recognition results in the original benchmark YOLOv8s, CenterNet, Rtmdet-tiny, YOLOv5s, and WCS-YOLOv8s models for three scenarios. Each row of the figure shows the detection results for the same strawberry images in the YOLOv8s model, CenterNet, Rtmdet-tiny, YOLOv5s, and WCS-YOLOv8s models, respectively. From the comparison of image a-5 with a-1, a-2, a-3, and a-4, it can be seen that the YOLOv8s and YOLOv5S models misidentified a leaf in the upper right corner of the image as an immature strawberry, and the analytical reason may be that the above models were worse at detecting smaller strawberry targets. As shown in the a-2 and b-2 of Figure 5, the strawberry flower and bloom targets were not effectively recognized by the CenterNet model, and the detection effect was poor; from a-5 and b-5 of Figure 5 compared with other images, the samples of four different growth periods. In the images could be detected better, and the effect was the best. As could be seen from image c-1, the YOLOv8s model detected a single ripe strawberry multiple times and incorrectly identified a single strawberry as multiple strawberries. As can be seen in c-1, c-2, c-3, and c-4, the YOLOv8s, CenterNet, Rtmdet-tiny, and YOLOv5s models failed to detect the ripe strawberries with stalks facing upwards, whereas the WCS-YOLOv8s model accurately identified them. Overall, the original YOLOv8s, CenterNet, Rtmdet-tiny, and YOLOv5s models had low accuracy when detecting small-sized strawberries and dealing with occluded targets, and were prone to omissions and false detections. Among them, the CenterNet model had the worst detection effect, with more errors and missed detections.WCS-YOLOv8s performed superiorly in small target detection, edge detection, dense detection, and branch and leaf occlusion, had significantly lower missed detections and false detections, and at the same time improved the detection confidence level.
Grad-CAM (Gradient-weighted Class Activation Mapping) is a deep learning visualization technique for explaining the decision-making process of convolutional neural networks (CNNs). It made the model’s decision-making process more transparent by highlighting the image regions that the model considers most important in the image classification task, enhancing the model’s interpretability. This visualization not only helped researchers identify erroneous or irrelevant features that the model may rely on but also provided guidance for model improvement. In order to explain the WCS-YOLOv8s model’s target identification and localization of regions of interest during the whole strawberry growth process, this paper performed the Grad-CAM heat map visualization of the baseline YOLOv8s model and the improved WCS-YOLOv8s model, in which the location of the regions of interest identified by the model was visualized on the target by the blue and red zones. The Grad-CAM heat map visualization is shown in Figure 6.
Figure 6As shown in Figure 6, three original images (Figures 6a-1, b-1, c-1) were selected for heat map visualization in this paper. The first row (a-1, b-1, and c-1) comprised the three original images; the second row (a-2, b-2, and c-2) was the heat map output from YOLOv8s; the third row (a-3, b-3, and c-3) was the heat map output from WCS-YOLOv8s. From the comparative analysis of a-3 and c-3 with a-2 and c-2 in Figure 6, respectively, it can be seen that the red areas cover more of the images, are more accurate, and more accurately cover small targets such as strawberry flower buds, which indicated that the area of interest focused on was more accurate when using the improved WCS-YOLOv8s for target recognition. From the comparative analysis of b-3 and c-3 with b-2 and c-2 in Figure 6, respectively, it can be seen that the red areas cover more area in the images and accurately cover the strawberry targets to be detected, indicating the higher accuracy of detection and the higher confidence of the category when using WCS-YOLOv8s for target recognition. It was proven by Grad-CAM heat map visualization that recognition using the WCS-YOLOv8s was better.
In this paper, the YOLO model detected the strawberry target position in the 2D image, obtained the 2D coordinate information (x, y) of the strawberry target, and then calculated the depth information of the strawberry through the depth map, so as to obtain the x, y, and z coordinates of the strawberry target with respect to the binocular depth camera, thus realizing the accurate recognition and localization of the strawberry’s position. The results of the improved model’s target recognition and localization are shown in Figure 7. In Figure 7, the detected strawberry target is represented by a red rectangular box, and the rectangular box is labeled as follows: strawberry target category, target confidence level, and target detection distance from the camera. Four strawberries are identified in Figure 7d. The image contains two unripe strawberries and two ripe strawberries. Taking the unripe strawberry identified at the top of the image as an example, the probability of the improved WCS-YOLOv8s model identifying this target as an unripe strawberry was 0.82, and this unripe strawberry was 21.93 cm away from the camera. Figures 7a, c, f demonstrate the target detection effect of the improved algorithm in complex scenes containing multiple targets and small targets. Figures 7b, d, e show the detection effect of the improved algorithm in simple scenes, demonstrating that the model accurately detected targets. In summary, WCS-YOLOv8s performed well in various scenarios, proving the effectiveness of the model. The model was able to provide comprehensive intelligent target recognition and could form the basis for the realization of robotic automation to collect models that require high recognition accuracy and fast recognition speed. The model constructed in this paper can provide reliable technical support for orchard strawberry yield prediction for automated intelligent picking.
Figure 74 Conclusion
In order to address the issues of low efficiency, high labor intensity, time consumption, and elevated costs associated with the manual identification, localization, and supervision of strawberries, this paper presents an innovative approach by proposing an enhanced model based on the YOLOv8s framework—the WCS-YOLOv8s model. This model was employed to effectively carry out strawberry target identification and localization while facilitating comprehensive supervision throughout the entire growth process of strawberries. In this paper, the Warmup data enhancement strategy was adopted to provide a stable convergence direction at the early stage of training, which effectively avoided model oscillation and improved the robustness of the model in complex scenes. The CGFM module was introduced to fuse different information through the multi-head self-attention mechanism, which significantly improved the recognition accuracy of the model in dealing with complex scenarios, including multiple targets, small targets, and occlusion problems, and could provide a reliable method for fruit target recognition and detection in complex scenarios. The developed SE-MSDWA module effectively integrates deep separable convolution, multi-scale convolution, and the SE module. This integration enhances the capability of sample feature extraction, thereby improving both the feature extraction efficiency and overall performance of the convolutional neural network. The accuracy, recall, mAP@0.5, and mAP@0.5:0.95 of the WCS-YOLOv8s model were 83.4%, 86.7%, 87.53%, and 60.48%, respectively, with a detection speed of 45.9 FPS. When compared to the baseline YOLOv8s model, 1.94% and 1.61% improvements in mAP@0.5 and mAP@0.5:0.95 metrics were observed, respectively, thus indicating a significant enhancement in the detection accuracy of the proposed model. The WCS-YOLOv8s model established in this paper provides a reliable new method of target identification and localization for automated management and picking and quality enhancement throughout the strawberry growth process.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
SG: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Validation, Visualization, Writing – review & editing. GC: Writing – review & editing, Funding acquisition, Investigation, Methodology, Visualization. QW: Funding acquisition, Writing – review & editing, Project administration, Supervision.
Funding
The author(s) declare that financial support was received for the research and/or publication of this article. The authors gratefully acknowledge the National Natural Science Foundation of China (No: and No: ), the Natural Science Foundation of Shandong Province (No. ZRQC114), and the Henan Provincial Scientific and Technological Research and Development Programme (No. ) for their support of this study.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Generative AI was used in the creation of this manuscript.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
If you are looking for more details, kindly visit yulu pear for export(th,tr,uz).
References
40
0
0


Comments
All Comments (0)