Leveraging Image Generators to Address Training Data Scarcity:
The Gen4Regen Dataset for Forest Regeneration Mapping

Gabriel Jeanson^1,2, David-Alexandre Duclos^1,2, William Larrivée-Hardy^1,2, Noé Cochet^1,2, Matěj Boxan^1,2, Anthony Deschênes², François Pomerleau^1,2, Philippe Giguère^1,2

¹ Northern Robotics Laboratory, Université Laval, Québec, QC, G1V 0A6, Canada ² Département d’informatique et de génie logiciel, Université Laval, Québec, QC, G1V 0A6, Canada

gabriel.jeanson@norlab.ulaval.ca | philippe.giguere@ift.ulaval.ca

Paper preprint Dataset (coming soon) GitHub

Gen4Regen teaser image — Overview of our model training pipeline, combining 199 manually-labelled, 25 106 pseudo-labelled real images, and 2101 synthetic labelled images. The pseudo-labelled real images are equivalent to over 550 000 image crops of 1 MP. The top left (purple outline) represents our primary contribution, the Gen4Regen dataset, which serves to demonstrate that image generators can now produce photorealistic images and corresponding annotations as viable training data for semantic segmentation. For this, we leverage the Nano Banana Pro model to produce 2101 synthetic, drone-view images and masks of boreal forest regeneration. This component is key in addressing the scarcity of labelled data and class imbalance.

Abstract

Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning-based interpretation is bottlenecked by the severe scarcity of expert-annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine-grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo-interpretation for high-resolution, millimetre-level aerial imagery. Importantly, we leverage the large-scale vision-language Nano Banana Pro model to simultaneously generate high-fidelity images and their corresponding pixel-aligned semantic masks from prompts. We introduce WilDReF-Q-V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real-world data with AI-generated images, addressing both data scarcity and class imbalance in forest regeneration mapping. We evaluate the orthogonality of manual labels, automated pseudo-labels, and synthetic data using Mask2Former and DINOv2 architectures, highlighting that AI-generated data is highly complementary to real-world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt-generated data significantly improve performance for underrepresented species, some of which saw per-species F1 score gains of up to 30 %pt. We conclude that vision-language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. This paradigm enables practitioners to bypass seasonal data collection constraints and develop specialized ecological monitoring systems with minimal manual effort.

BibTeX

@misc{jeanson2026_gen4regen,
    title = {Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping},
    author = {Jeanson, Gabriel and Duclos, David-Alexandre and Larri\`{e}e-Hardy, William and Cochet, No\'{e} and Boxan, Mat\v{e}j and Desch\^{e}nes, Anthony and Pomerleau, Fran\c{c}ois and Gigu\`{e}re, Philippe},
    year={2026},
    eprint={2605.05627},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2605.05627},
}