Bioprocess Simulation Dataset Generator
An intelligent synthetic dataset generator that produces realistic bioprocess testing data for developing calculation scripts within bioprocess workflows. Built for scientific simulation, regulatory testing, and model robustness under edge-case scenarios.
Technologies
Overview
Designed for Qubicon’s simulation engine, the AI Dataset Generator produces high-fidelity datasets based on real-world bioprocess parameters, setpoints, and expected sensor fluctuations. The tool uses Meta's Llama 3 to generate domain-specific data with embedded logic, enabling simulation of rare events, dynamic shifts, and environmental variations at scale within a bioprocess environment.
Challenge
Creating specific bioprocess data is time-consuming and often infeasible due to cost, time and efficiency. Teams at Qubicon used to manually build datasets for their simulation engine to test their specific scripts that they build for bioprocess workflows. Those scripts often contain complex logic, edge cases, and dynamic behaviors which must be tested on bioprocess simulation data. Building the datasets manually was a tedious process and cost the company money.
Solution
I came up with the idea and engineered an AI-driven generation pipeline that uses prompt-engineered LLMs and custom simulation logic to create structured realistic simulation datasets that reflect the specific edge cases desired for testing. It integrates with CSV templates compatible with Qubicon’s platform and models dynamic behaviors including phase transitions, setpoint adjustments, and failure simulations.
Results
Increased the company's efficiency and cost-effectiveness. Reduced dataset development time by over 90% and enabled teams to test feeding algorithms and control models against highly realistic edge-case scenarios. Supported rapid prototyping and model validation across multiple process types.
Project Gallery

The generator interface showing parameter controls and output visualization

Example of inputing a variable (a script we want to test using the simulation, which we need data for) in the generator

Example of inputting an error in the generator that we want reflected in the dataset for testing