Context
At Mecene.ai, the goal was to turn long-form podcast episodes into short clips that were actually worth publishing. The hard part was not only cutting video. It was choosing segments with enough value, detecting who was speaking, and producing something that felt edited instead of obviously automated.
What I built
I worked on the end-to-end system that takes a podcast video and turns it into multiple short clips.
- A dataset pipeline built from YouTube content to support better clip selection and model behavior.
- Fine-tuning work with ChatGPT and Gemini to improve how the system identifies interesting moments.
- Automatic speaker detection and framing logic so the final video keeps the active speaker centered.
- A decentralized processing pipeline on Runpod so GPU-heavy jobs could be split into stages and scaled more predictably.
Why it was interesting
This project sits at the intersection of product and research. A technically correct output was not enough. The clips had to feel engaging, correctly framed, and useful for distribution on social platforms.
That meant balancing model quality, infrastructure cost, and product speed at the same time.
Technical focus
- Python services for orchestration and processing
- PyTorch for model work
- OpenCV for video manipulation and framing
- Whisper for transcription and speech-related processing
- Docker and Runpod for deployment and GPU execution
Engineering decisions
One of the main design decisions was to treat the system as a pipeline instead of one monolithic job. That made it easier to isolate GPU-heavy work, improve reliability, and evolve individual stages without rebuilding the whole flow.
The other important constraint was quality. A clip that was technically valid but boring or badly framed was still a failure, so model behavior and presentation quality had to be optimized together.