Dopamine Audiobook

Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Immersive Audiobook Generation

[Paper] [Code]

Yan Rong, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Abstract: Audiobook generation aims to create rich, immersive listening experiences from multimodal inputs, but current approaches face three critical challenges: (1) the lack of synergistic generation of diverse audio types (e.g., speech, sound effects, and music) with precise temporal and semantic alignment; (2) the difficulty in conveying expressive, fine-grained emotions, which often results in machine-like vocal outputs; and (3) the absence of automated evaluation frameworks that align with human preferences for complex and diverse audio. To address these issues, we propose Dopamine Audiobook, a novel unified training-free multi-agent system, where a multimodal large language model (MLLM) serves two specialized roles (i.e., speech designer and audio designer) for emotional, human-like, and immersive audiobook generation and evaluation. Specifically, we firstly propose a flow-based, context-aware framework for diverse audio generation with word-level semantic and temporal alignment. To enhance expressiveness, we then design word-level paralinguistic augmentation, utterance-level prosody retrieval, and adaptive TTS model selection. Finally, for evaluation, we introduce a novel MLLM-based evaluation framework incorporating self-critique, perspective-taking, and psychological MagicEmo prompts to ensure human-aligned and self-aligned assessments. Moreover, we build MIA-Bench, the first benchmark for multimodality immersive audiobook generation, comprising 300 data pairs with rich annotations. Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance on multiple metrics, while our evaluation framework demonstrates superior alignment with human preferences.

Generation Framework

Evaluation Framework

MagicEmo Prompts

Demo

Image-Input Audiobook Generation
Text-Input Audiobook Generation
Multiple-Images-Input Audiobook Generation

Image-Input Audiobook Generation

Results of Dopamine Audiobook

Input Image	Ours (w/o audio designer)	Ours

Part of the Spotlight

Input	Content	Ours	Ours (w/o AD)	F5-TTS	FireRedTTS	MaskGCT
	Evacuation in progress. All emergency routes are open. Remain calm and cooperate with authorities.
	Faster, Otis, faster! That rope swing was fun, but my stomach is rumbling for a morning donut!
	What is it, Milo? Is the shop closed? Did they run out of the sprinkled kind?
	Swinging high, Milo paused, his wide eyes fixed on a new sight down the street. A small, tearful girl sat alone on a bench, a broken toy car in her hands.
	艾拉从抽屉里拿出了一支蜡烛和一盒火柴。在漆黑中，她小心翼翼地划亮了一根火柴。
	希望？就像这个小小的光吗？
	对，莉莉。只要有它在，我们就不怕黑。今晚，我们就让这个小小的光，陪我们一起讲故事，好不好？
	突如其来的黑暗吞噬了整个房间，窗外狂风呼啸，暴雨倾盆。

Text-Input Audiobook Generation

Results of Dopamine Audiobook

Input Text	Ours (w/o audio designer)	Ours
李白有感而发，作诗《将进酒》 Li Bai, a Chinese poet of the Tang era, was inspired to write the poem "Qiang Jin Jiu" (Bring in the Wine).
一位母亲在孩子睡前读诗《将进酒》 Before putting her child to sleep, a mother reads them the classic Chinese poem "Qiang Jin Jiu."
The scene is summer community BBQ. Community members organize a summer barbecue, bringing people together for grilled food, music, and outdoor fun, fostering a sense of unity and camaraderie.
The scene is Dorm Room on a Friday Evening. The dimly lit dorm room is buzzing with excitement as the college students gather to plan their weekend activities.
社交媒体对人际关系的影响的节目秀 A Show on Social Media's Impact on Relationships
Generate an introduction to quantum mechanics by a professor.

Part of the Spotlight

Input	Content	Ours	Ours (w/o AD)	F5-TTS	FireRedTTS	MaskGCT
李白有感而发，作诗《将进酒》	天生我材必有用，千金散尽还复来。
李白有感而发，作诗《将进酒》	烹羊宰牛且为乐，会须一饮三百杯。
李白有感而发，作诗《将进酒》	五花马，千金裘，呼儿将出换美酒，与尔同销万古愁！
一位母亲在孩子睡前读诗《将进酒》	人生得意须尽欢，莫使金樽空对月。
一位母亲在孩子睡前读诗《将进酒》	天生我材必有用，千金散尽还复来。
一位母亲在孩子睡前读诗《将进酒》	岑夫子，丹丘生，将进酒，杯莫停。

Multiple-Images-Input Audiobook Generation

Results of Dopamine Audiobook

Input Image Sequences

Ours (w/o audio designer)	Ours

Part of the Spotlight

Input	Content	Ours	Ours (w/o AD)	F5-TTS	FireRedTTS	MaskGCT
	The sun bathed the park in a warm, golden light as birds chirped a cheerful tune from the treetops. A gentle breeze rustled through the leaves, creating a perfect picture of tranquility.
	Bwaah... must... unclog... the sun...
	It's obvious! He's lost his favorite plunger! Quick, to the rescue!
	That fly has bwaah'd its last bwaah. This is no longer about a nap. This is about vengeance.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. If you have any concerns, please contact us.