Irodori-TTS-500M-v2-VoiceDesign
Irodori-TTS-500M-v2-VoiceDesign is a Japanese Text-to-Speech model based on a Rectified Flow Diffusion Transformer (RF-DiT) architecture. Derived from the base v2 model, this variant replaces the reference latent encoder with a caption encoder.
Instead of requiring reference audio for voice cloning, this model features Voice Designโallowing you to fully control and generate the speaker's voice, emotion, and speaking style solely through a descriptive text prompt (caption).
Additionally, the model retains the emoji-based style and sound effect control โ by inserting specific emojis into the input text, you can further fine-tune speaking styles, emotions, and sound effects during generation.
๐ Key Features
- Flow Matching TTS: Rectified Flow Diffusion Transformer over continuous DACVAE latents for high-quality Japanese speech synthesis.
- Voice Design (Caption Conditioning): Generate diverse voices by describing the speaker's tone, age, gender, and emotion in a text caption. No reference audio is required.
- Emoji-based Style Control: Embed emojis directly in the input text for granular control over the delivery. See
EMOJI_ANNOTATIONS.mdfor the full list of supported emojis.
๐๏ธ Architecture
The model (approximately 500M parameters) consists of three main components:
- Text Encoder: Token embeddings initialized from llm-jp/llm-jp-3-150m, followed by self-attention + SwiGLU transformer layers with RoPE.
- Caption Encoder: Encodes the style-control text (captions) to define the speaker and acoustic environment, bypassing the need for a reference audio branch.
- Diffusion Transformer: Joint-attention DiT blocks with Low-Rank AdaLN (timestep-conditioned adaptive layer normalization), half-RoPE, and SwiGLU MLPs.
Audio is represented as continuous latent sequences via the Aratako/Semantic-DACVAE-Japanese-32dim codec (32-dim), enabling high-quality 48kHz waveform reconstruction.
๐ง Audio Samples
Voice Design via Captions
Examples of controlling the speaker's voice, emotion, and style purely through descriptive text captions. Notice how the same input text can be delivered in entirely different ways.
| Text (Input) | Caption (Voice Design) | Generated Audio |
|---|---|---|
| ๆๆฅใฎๅๅพใซไบๅฎใใฆใใไผ่ญฐใ ใใฉใๆฅ้ฝๆฅ้ฑใซๅปถๆใซใชใฃใใฟใใใๆชใใใ ใใฉใ่ณๆใฎๆบๅใฏไธๆฆในใใใใใฆใใใฆใใใใใชใ | ไฝใๅฃฐใฎๅฅณๆงใใ่็ซใกใ้ ใใชใๆงๅญใง็ฆใฃใฆ่ฉฑใใฆใใใใฏใชใขใช้ณ่ณชใๅฐใๆๆ ็ใชใใผใณใงใๅใใใใใชๆงๅญใ | |
| ๆๆฅใฎๅๅพใซไบๅฎใใฆใใไผ่ญฐใ ใใฉใๆฅ้ฝๆฅ้ฑใซๅปถๆใซใชใฃใใฟใใใๆชใใใ ใใฉใ่ณๆใฎๆบๅใฏไธๆฆในใใใใใฆใใใฆใใใใใชใ | ใใ้ซใใฎ็ทๆงใฎๅฃฐใงใๆฐ้ฃใใ่ฆใใฆ็ณใ่จณใชใใใใชใใผใณใงใใใใ่ฉฑใใฆใปใใใ | |
| ใใใใใชใใใฃใใพใง็ขบใใซใใใซใใฃใใฏใใชใใ ใใฉใ่ชฐใใๆฐใๅฉใใใฆใๅฅใฎๅ ดๆใซ็ไปใใกใใฃใใฎใใชใ | ่ฅใใฎๅฅณๆงใใๅฐๆใใฆใใๆงๅญใง็ฌใ่จใ่จใใใฎใใใซๅใใใใชๅฃฐใง่ฉฑใใฆใใใ | |
| ใใใใใชใใใฃใใพใง็ขบใใซใใใซใใฃใใฏใใชใใ ใใฉใ่ชฐใใๆฐใๅฉใใใฆใๅฅใฎๅ ดๆใซ็ไปใใกใใฃใใฎใใชใ | ๅผทใ็ใใไธๆบใ่ฆใใฆใใใใใชๆงๅญใฎๅคงไบบใฎๅฅณๆงใใใชใๆใฃใฆใใๆงๅญใงใใใใจๅจใใซ่ใใใใใใซๅคงใใชๅฃฐใง่ฉฑใใฆใใใ |
Combining Voice Design with Emojis
You can also combine Voice Design captions with emoji annotations embedded in the text. This allows for even finer control over the delivery, adding specific emotional nuances, pauses, or sound effects on top of the base voice design.
| Text (with Emoji) | Caption (Voice Design) | Generated Audio |
|---|---|---|
| ใใใๆจๆฅใใใใฃใจๆบใฎไธใซ็ฝฎใใฃใฑใชใใซใชใฃใฆใพใใ๐คญๆฉใ็ไปใใฆใใใฆใใ ใใใญ๐ซถ | ๅคงไบบใฎ็ทๆงใใ้ไธญ็ฌใใชใใใใใใ่ซญใใใใซ่ฉฑใใฆใใใไฝ่ฃใๆใใๆงๅญใงใๅฐใๅใใๆททใใฃใฆใใใใใชใใผใณใ | |
| ใใ๐ ใๆจๆฅใใใใฃใจๆบใฎไธใซ็ฝฎใใฃใฑใชใใซใชใฃใฆใพใใ๐ๆฉใ็ไปใใฆใใใฆใใ ใใใญ๐ | ไฝใใฎๅฅณๆงใฎๅฃฐใงใๅซๆชๆใ็คบใใชใใๆใฃใฆใใใใใซ่ฉฑใใฆใปใใใงใใ้ไธญใง่ๆใกใๆใฟใๅผทใๆใใฟใๆใฃใฆ่ฆไธใใฆใใๆใใงใ้กใใใพใใ |
๐ Usage
For inference code, installation instructions, and training scripts, please refer to the GitHub repository:
๐ GitHub: Aratako/Irodori-TTS
๐ Training Data & Annotation
The model was trained on a high-quality Japanese speech dataset. To enable the Voice Design functionality, the training data was enriched with comprehensive text captions describing the audio characteristics.
The emoji annotations and initial text captions were generated and labeled using a fine-tuned model based on Qwen/Qwen3-Omni-30B-A3B-Instruct. Subsequently, the text captions were rephrased and refined using Qwen/Qwen3.5-35B-A3B.
โ ๏ธ Limitations
- Japanese Only: This model currently supports Japanese text input only.
- Prompt Adherence: While the model generally follows the caption's instructions, highly complex or contradictory descriptions might result in inconsistent voice generation.
- Emoji Control: While emoji-based style control adds expressiveness, the effect may vary depending on context and is not always perfectly consistent.
- Kanji Reading Accuracy: The model's ability to accurately read Kanji is relatively weak compared to other TTS models of a similar size. You may need to convert complex Kanji into Hiragana or Katakana beforehand.
๐ License & Ethical Restrictions
License
This model is released under MIT.
Ethical Restrictions
In addition to the license terms, the following ethical restrictions apply:
- No Impersonation: Do not use this model's captioning capabilities to intentionally generate voices that impersonate specific real-world individuals (e.g., voice actors, celebrities, public figures) without their explicit consent.
- No Misinformation: Do not use this model to generate synthetic speech intended to mislead others or spread misinformation.
- Disclaimer: The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.
๐ Acknowledgments
This project builds upon the following works:
- Echo-TTS โ Architecture and training design reference
- DACVAE โ Audio VAE
- llm-jp/llm-jp-3-150m โ Tokenizer and embedding weight initialization
We would also like to extend our special thanks to Respair for the inspiration behind the emoji annotation feature.
๐๏ธ Citation
If you use Irodori-TTS in your research or project, please cite it as follows:
@misc{irodori-tts,
author = {Chihiro Arata},
title = {Irodori-TTS: A Flow Matching-based Text-to-Speech Model with Emoji-driven Style Control},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/Irodori-TTS-500M-v2-VoiceDesign}}
}
Model tree for Aratako/Irodori-TTS-500M-v2-VoiceDesign
Base model
Aratako/Irodori-TTS-500M-v2