Irodori-TTS-500M-v2-VoiceDesign

Code Demo Space

Irodori-TTS-500M-v2-VoiceDesign is a Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. Derived from the base v2 model, this variant replaces the reference latent encoder with a caption encoder.

Instead of requiring reference audio for voice cloning, this model features Voice Designโ€”allowing you to fully control and generate the speaker's voice, emotion, and speaking style solely through a descriptive text prompt (caption).

Additionally, the model retains the emoji-based style and sound effect control โ€” by inserting specific emojis into the input text, you can further fine-tune speaking styles, emotions, and sound effects during generation.

๐ŸŒŸ Key Features

  • Flow Matching TTS: Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis.
  • Voice Design (Caption Conditioning): Generate diverse voices by describing the speaker's tone, age, gender, and emotion in a text caption. No reference audio is required.
  • Emoji-based Style Control: Embed emojis directly in the input text for granular control over the delivery. See EMOJI_ANNOTATIONS.md for the full list of supported emojis.

๐Ÿ—๏ธ Architecture

The model (approximately 500M parameters) consists of three main components:

  1. Text Encoder: Token embeddings initialized from llm-jp/llm-jp-3-150m, followed by self-attention + SwiGLU transformer layers with RoPE.
  2. Caption Encoder: Encodes the style-control text (captions) to define the speaker and acoustic environment, bypassing the need for a reference audio branch.
  3. Diffusion Transformer: Joint-attention DiT blocks with Low-Rank AdaLN (timestep-conditioned adaptive layer normalization), half-RoPE, and SwiGLU MLPs.

Audio is represented as continuous latent sequences via the Aratako/Semantic-DACVAE-Japanese-32dim codec (32-dim), enabling high-quality 48kHz waveform reconstruction.

๐ŸŽง Audio Samples

Voice Design via Captions

Examples of controlling the speaker's voice, emotion, and style purely through descriptive text captions. Notice how the same input text can be delivered in entirely different ways.

Text (Input) Caption (Voice Design) Generated Audio
ๆ˜Žๆ—ฅใฎๅˆๅพŒใซไบˆๅฎšใ—ใฆใ„ใŸไผš่ญฐใ ใ‘ใฉใ€ๆ€ฅ้ฝๆฅ้€ฑใซๅปถๆœŸใซใชใฃใŸใฟใŸใ„ใ€‚ๆ‚ชใ„ใ‚“ใ ใ‘ใฉใ€่ณ‡ๆ–™ใฎๆบ–ๅ‚™ใฏไธ€ๆ—ฆใ‚นใƒˆใƒƒใƒ—ใ—ใฆใŠใ„ใฆใใ‚Œใ‚‹ใ‹ใชใ€‚ ไฝŽใ„ๅฃฐใฎๅฅณๆ€งใŒใ€่‹›็ซ‹ใกใ‚’้š ใ›ใชใ„ๆง˜ๅญใง็„ฆใฃใฆ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ใ‚ฏใƒชใ‚ขใช้Ÿณ่ณชใ€ๅฐ‘ใ—ๆ„Ÿๆƒ…็š„ใชใƒˆใƒผใƒณใงใ€ๅ‘†ใ‚ŒใŸใ‚ˆใ†ใชๆง˜ๅญใ€‚
ๆ˜Žๆ—ฅใฎๅˆๅพŒใซไบˆๅฎšใ—ใฆใ„ใŸไผš่ญฐใ ใ‘ใฉใ€ๆ€ฅ้ฝๆฅ้€ฑใซๅปถๆœŸใซใชใฃใŸใฟใŸใ„ใ€‚ๆ‚ชใ„ใ‚“ใ ใ‘ใฉใ€่ณ‡ๆ–™ใฎๆบ–ๅ‚™ใฏไธ€ๆ—ฆใ‚นใƒˆใƒƒใƒ—ใ—ใฆใŠใ„ใฆใใ‚Œใ‚‹ใ‹ใชใ€‚ ใ‚„ใ‚„้ซ˜ใ‚ใฎ็”ทๆ€งใฎๅฃฐใงใ€ๆฐ—้ฃใ„ใ‚’่ฆ‹ใ›ใฆ็”ณใ—่จณใชใ•ใใ†ใชใƒˆใƒผใƒณใงใ‚„ใ•ใ—ใ่ฉฑใ—ใฆใปใ—ใ„ใ€‚
ใŠใ‹ใ—ใ„ใชใ€ใ•ใฃใใพใง็ขบใ‹ใซใ“ใ“ใซใ‚ใฃใŸใฏใšใชใ‚“ใ ใ‘ใฉใ€‚่ชฐใ‹ใŒๆฐ—ใ‚’ๅˆฉใ‹ใ›ใฆใ€ๅˆฅใฎๅ ดๆ‰€ใซ็‰‡ไป˜ใ‘ใกใ‚ƒใฃใŸใฎใ‹ใชใ€‚ ่‹ฅใ‚ใฎๅฅณๆ€งใŒใ€ๅ›ฐๆƒ‘ใ—ใฆใ„ใ‚‹ๆง˜ๅญใง็‹ฌใ‚Š่จ€ใ‚’่จ€ใ†ใ‹ใฎใ‚ˆใ†ใซๅ›ใใ‚ˆใ†ใชๅฃฐใง่ฉฑใ—ใฆใ„ใ‚‹ใ€‚
ใŠใ‹ใ—ใ„ใชใ€ใ•ใฃใใพใง็ขบใ‹ใซใ“ใ“ใซใ‚ใฃใŸใฏใšใชใ‚“ใ ใ‘ใฉใ€‚่ชฐใ‹ใŒๆฐ—ใ‚’ๅˆฉใ‹ใ›ใฆใ€ๅˆฅใฎๅ ดๆ‰€ใซ็‰‡ไป˜ใ‘ใกใ‚ƒใฃใŸใฎใ‹ใชใ€‚ ๅผทใ„็–‘ใ„ใ‚„ไธๆบ€ใ‚’่ฆšใˆใฆใ„ใ‚‹ใ‚ˆใ†ใชๆง˜ๅญใฎๅคงไบบใฎๅฅณๆ€งใ€‚ใ‹ใชใ‚Šๆ€’ใฃใฆใ„ใ‚‹ๆง˜ๅญใงใ€ใ‚ใ–ใจๅ‘จใ‚Šใซ่žใ“ใˆใ‚‹ใ‚ˆใ†ใซๅคงใใชๅฃฐใง่ฉฑใ—ใฆใ„ใ‚‹ใ€‚

Combining Voice Design with Emojis

You can also combine Voice Design captions with emoji annotations embedded in the text. This allows for even finer control over the delivery, adding specific emotional nuances, pauses, or sound effects on top of the base voice design.

Text (with Emoji) Caption (Voice Design) Generated Audio
ใ“ใ‚Œใ€ๆ˜จๆ—ฅใ‹ใ‚‰ใšใฃใจๆœบใฎไธŠใซ็ฝฎใใฃใฑใชใ—ใซใชใฃใฆใพใ™ใ‚ˆ๐Ÿคญๆ—ฉใ็‰‡ไป˜ใ‘ใฆใŠใ„ใฆใใ ใ•ใ„ใญ๐Ÿซถ ๅคงไบบใฎ็”ทๆ€งใŒใ€้€”ไธญ็ฌ‘ใ„ใชใŒใ‚‰ใ‚„ใ•ใ—ใ่ซญใ™ใ‚ˆใ†ใซ่ฉฑใ—ใฆใ„ใ‚‹ใ€‚ไฝ™่ฃ•ใ‚’ๆ„Ÿใ˜ใ‚‹ๆง˜ๅญใงใ€ๅฐ‘ใ—ๅ‘†ใ‚Œใ‚‚ๆททใ–ใฃใฆใ„ใ‚‹ใ‚ˆใ†ใชใƒˆใƒผใƒณใ€‚
ใ“ใ‚Œ๐Ÿ˜ ใ€ๆ˜จๆ—ฅใ‹ใ‚‰ใšใฃใจๆœบใฎไธŠใซ็ฝฎใใฃใฑใชใ—ใซใชใฃใฆใพใ™ใ‚ˆ๐Ÿ˜’ๆ—ฉใ็‰‡ไป˜ใ‘ใฆใŠใ„ใฆใใ ใ•ใ„ใญ๐Ÿ˜  ไฝŽใ‚ใฎๅฅณๆ€งใฎๅฃฐใงใ€ๅซŒๆ‚ชๆ„Ÿใ‚’็คบใ—ใชใŒใ‚‰ๆ€’ใฃใฆใ„ใ‚‹ใ‚ˆใ†ใซ่ฉฑใ—ใฆใปใ—ใ„ใงใ™ใ€‚้€”ไธญใง่ˆŒๆ‰“ใกใ‚’ๆŒŸใฟใ€ๅผทใ„ๆ†Žใ—ใฟใ‚’ๆŒใฃใฆ่ฆ‹ไธ‹ใ—ใฆใ„ใ‚‹ๆ„Ÿใ˜ใงใŠ้ก˜ใ„ใ—ใพใ™ใ€‚

๐Ÿš€ Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

๐Ÿ‘‰ GitHub: Aratako/Irodori-TTS

๐Ÿ“Š Training Data & Annotation

The model was trained on a high-quality Japanese speech dataset. To enable the Voice Design functionality, the training data was enriched with comprehensive text captions describing the audio characteristics.

The emoji annotations and initial text captions were generated and labeled using a fine-tuned model based on Qwen/Qwen3-Omni-30B-A3B-Instruct. Subsequently, the text captions were rephrased and refined using Qwen/Qwen3.5-35B-A3B.

โš ๏ธ Limitations

  • Japanese Only: This model currently supports Japanese text input only.
  • Prompt Adherence: While the model generally follows the caption's instructions, highly complex or contradictory descriptions might result in inconsistent voice generation.
  • Emoji Control: While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent.
  • Kanji Reading Accuracy: The model's ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand.

๐Ÿ“œ License & Ethical Restrictions

License

This model is released under MIT.

Ethical Restrictions

In addition to the license terms, the following ethical restrictions apply:

  1. No Impersonation: Do not use this model's captioning capabilities to intentionally generate voices that impersonate specific real-world individuals (e.g., voice actors, celebrities, public figures) without their explicit consent.
  2. No Misinformation: Do not use this model to generate synthetic speech intended to mislead others or spread misinformation.
  3. Disclaimer: The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.

๐Ÿ™ Acknowledgments

This project builds upon the following works:

We would also like to extend our special thanks to Respair for the inspiration behind the emoji annotation feature.

๐Ÿ–Š๏ธ Citation

If you use Irodori-TTS in your research or project, please cite it as follows:

@misc{irodori-tts,
  author = {Chihiro Arata},
  title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2-VoiceDesign}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Aratako/Irodori-TTS-500M-v2-VoiceDesign

Finetuned
(1)
this model

Space using Aratako/Irodori-TTS-500M-v2-VoiceDesign 1

Collection including Aratako/Irodori-TTS-500M-v2-VoiceDesign