13 March 2026

Gemini Live Agent Challenge: StepPrep

by Jeff

Google recently posted a hackathon challenge based on Gemini’s ability to use live audio, video and text. My Sister is an audio engineer and voice artist and I thought it would be a good opportunity to collaborate with her on a project and I’ve always wanted to build something using Gemini’s multimedia capabilities.

We brainstormed a bit, and settled on the creative storyteller category. The interactive storybook example made her think of the storyboards she makes for her son who has autism. Storyboards are useful tools to help children anticipate new events that you or I may take for granted like going to the movies, getting a haircut, shopping at the store, or going to the doctor. They give a chance for child and parent to map out what will happen, anticipate trouble spots, highlights and ease any fears in anticipation of a new experience.

Gemini’s live, multimedia capabilities seemed like they could help with the primary challenge for storyboards; they are time and labor intensive to produce. Instead of spending hours gathering materials, images, and arranging them, could Gemini make this quick, fun and interactive?

You can see the finished product we christened “StepPrep” in this short video my sister made for the project.

ScreenShots

Iterating via live conversation

An example StepPrep Storyboard

Here’s a bit behind the scenes of the experience creating it and the tech behind it. You can find the code in this github repo.

Multimodal User Experience: The “Beyond Text” Factor

StepPrep sidesteps the “text box” AI chatbot paradigm by offering a genuinely multimodal, voice-first experience that feels natural and intuitive. Using the Gemini Live API over raw WebSockets, the application establishes a real-time, bi-directional audio connection between the user and the AI. Instead of typing out complex descriptions of an event and a child’s needs, parents can simply talk naturally as they would to a therapist or planner. The AI listens, prompts for key call outs from sensory sensitivities to specific words to avoid, and interactively collaborates to build a storyboard. By managing audio chunking and playback via custom Web Audio API worklets directly in a React progressive web application, the system provides an immersive, zero-friction conversational flow that effectively “sees, hears, and speaks” to the parent, bypassing the cognitive load of a standard chat UI entirely.

I was surprised at how useful it is that Gemini’s audio capabilities include the ability to interrupt the conversation. Rather than waiting for the model to finish a long sentence, you can engage as in a natural conversation and not worry about potential interruption, especially if the model has misheard a word and needs a course correction.

Workflow: A Coherent Multimodal Journey

The architecture of StepPrep aims to weave text, audio, and visual generation into a unified workflow. As the parent speaks to the system via native web browser PCM audio streams, the Gemini Live API dynamically translates those intents into structured JSON representing storyboard steps (`step_title`, `description`, and `image_prompt`). The backend intercepts these function calls and immediately synchronizes the state to the frontend UI via WebSockets, instantly materializing spoken ideas into readable text.

Having this real time live feedback over audio gives the parent and the AI the ability to iterate on the storyboard to add, remove or change steps as needed. It feels very natural to have a discussion about the initial draft and revise as needed with instant visual feedback.

Once the parent approves the drafted steps, the system triggers the famous ‘nano banana’ Gemini model to render tailored visual panels covering the steps in a variety of styles from pencil sketch to superhero. The final product is a 3x2 interactive grid where the board can be used to discuss the upcoming event and mark progress as the steps are completed. The StepPrep has a permanent URL that can be bookmarked and revisited as needed for future re-use.

Technical Implementation: Google Cloud Native

StepPrep leverages Google Cloud serverless architecture and the Google GenAI SDK. The system separates concerns by deploying a Python FastAPI backend and a React frontend as distinct services on Google Cloud Run for autoscaling and observability. The backend relies on the GenAI SDK to interface directly with Vertex AI for both the low-latency Gemini Live API streams and asynchronous image generation tasks. Application state and generated steps are synchronized in real-time across devices using Firebase Firestore, while the generated image panels are securely persisted to Google Cloud Storage. An automated CI/CD pipeline built on Google Cloud Build and Terraform handles everything from initial GCP project bootstrapping to zero-downtime Cloud Run deployments. You can find the code for the project in this github repo.

Robustness: Grounding and Preventing Hallucinations

To ensure StepPrep remains a reliable tool, the system implements strict boundary controls. The underlying system prompt explicitly constrains the assistant’s persona, mandating a focus solely on creating step-by-step visual guides and explicitly barring unrelated topics or general advice. The prompt rigidly instructs the agent on how to manage the interaction—forcing it to ask clarifying questions about triggers or words to avoid before making assumptions. At the audio layer, WebAudio is configured to distinguish human speech from background noise, preventing hallucinating responses to ambient sounds or the agent’s own voice. By chaining these rigid system instructions with backend schema validation, the application successfully prevents off-topic digressions and guarantees structurally sound, safe storyboards. Interesting future work would be to try Google’s model armor or other guardrail systems against voice input.

Development Challenges and Solutions

Building a real-time multimodal application introduced several nuanced engineering hurdles. An unexpected but welcome problem was managing audio sampling rates: browsers typically default to 44.1kHz or 48kHz, whereas the Gemini Live API strictly expects 16kHz input and outputs 24kHz audio. Synching these required a bit of WebAudio math but helped prevent glitching during the conversation and helps conserve tokens used by the LLM.

A significant hurdle involved coordinating Gemini’s backend asynchronous function calling with the frontend UI. Agents are happy to call functions but in this case we had to intercept function calls on the backend, validate them, and push the payload over a secondary WebSocket channel so the user could review the steps while keeping the Gemini Live API session alive for subsequent verbal revisions.

When the Gemini Live API attempts to invoke the `generate_storyboard` function, the FastAPI backend acts as an authoritative middleware layer. It intercepts the call and strictly validates the schema—ensuring exactly six steps are provided and that every step contains the mandatory `step_title`, `description`, and `image_prompt`. If Gemini hallucinates a malformed structure or returns an incorrect number of steps, the backend rejects the function call and feeds the validation error directly back to the agent for self-correction. Gemini is great at recognizing errors and self correcting when given feedback and this validation step greatly helped maintain the flow of the conversation.

Lastly the dreaded 429 resource exhaustion is real. Quotas, limits for requests per minute and differences between Vertex and the API Key issued genai clients can cause confusion. For example instantiating a genai client with the API Key option does not allow you to specify certain parameters supported via vertex:
Error: person_generation parameter is not supported in Gemini API.

More details on this in the project notes if you’d like gory details.

Conclusion

This was a fun opportunity to both get a chance to work with my talented Sister and take new tech for a spin. Bonus points if the tech is actually useful to her and others in daily life!
I’ll certainly consider the Gemini Live API for future AI projects as it offers a unique, very natural interaction you don’t get from just command lines or text boxes.

tags: gemini - google cloud - AI - audio

blog home | jeff's home | vCISO services

Jeff Bryner .:. blog

Jeff's blog on infosec and other topics