Edge TTS Subtitle Dubbing (Numpy/Librosa)
🇬🇧 A robust Python tool for converting SRT subtitles into synchronized audio using Microsoft Edge TTS, featuring sample-accurate timing with Numpy and Librosa.
🇹🇷 Türkçe: Edge TTS Altyazı Dublaj Aracı (Numpy/Librosa)
This tool converts SRT subtitles into a single, synchronized audio file using Microsoft Edge TTS with sample-accurate audio processing. It uses a strict Time-Slot Filling algorithm powered by Numpy and Librosa to ensure the generated audio perfectly matches the duration of the original video, preventing desynchronization over time.
| Repo
Key Features
- Sample-Accurate Synchronization: Uses Numpy/Librosa for precise, sample-level audio concatenation ensuring perfect timing.
- Memory Optimized: List-based accumulation buffer prevents O(N²) memory copying, making it efficient even for very long videos.
- High-Quality Time-Stretching: Uses
audiostretchylibrary for superior audio quality when adjusting speech speed. - Async Batch Processing: Generates TTS audio in parallel for 2-3x faster processing.
- Smart Text Caching: Automatically reuses audio for identical text segments, saving up to 50% on repetitive content.
- Time-Slot Filling Sync: Ensures every subtitle block takes up exactly the amount of time defined in the SRT, inserting silence if the spoken audio is too short.
- Perfect Video Match: Can pad the final audio to match your video’s exact length using
--ref_video. - Smart Silence: Inserts calculation-precise silence between lines with sample-level accuracy.
- Multi-Language: Supports all languages and voices provided by Microsoft Edge TTS.
- Neural Voices: Uses high-quality Neural voices like
en-US-JennyNeural,tr-TR-AhmetNeural. - Resume Capability: Can resume from where it left off if interrupted.
- Automatic Late-Start Handling: Intelligently handles overlapping subtitles by forcing maximum speed compression.
- Progress Statistics: Detailed real-time statistics showing generation, caching, and error counts.
Prerequisites
- Python 3.8+
- FFmpeg installed and in PATH (required for
ffprobemedia duration detection)
Clone and enter directory
1
2
3
git clone https://github.com/fr0stb1rd/Edge-TTS-Subtitle-Dubbing.git
cd Edge-TTS-Subtitle-Dubbing
Virtual Env (Recommended)
1
2
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
Dependencies
1
pip install -r requirements.txt
Usage
Basic command:
1
python src/main.py <input.srt> <output.wav> --voice <voice_name>
Example:
1
python src/main.py tr.srt output.wav --voice tr-TR-AhmetNeural
Advanced Options
| Flag | Description | Default |
|---|---|---|
--voice <name> |
Edge TTS Voice name (Run edge-tts --list-voices). |
en-US-JennyNeural |
--ref_video <path> |
Path to original video. Adds silence at the end to match duration exactly. | None |
--expected_duration <val> |
Manual total duration (Seconds or HH:MM:SS) if video is not available. | None |
--max_speed <val> |
Max speed-up factor (e.g. 2.0). Increase if you see many ‘Overlap’ warnings. | 1.5 |
--temp <path> |
Specify a custom temporary directory. | temp/ in current dir |
--keep-temp |
Don’t delete temporary files after finishing. | False (Auto-delete) |
--resume |
Resume processing existing temp files. | False |
--no-concat |
Generate segments only, skip final merge. | False |
--batch_size <num> |
Number of concurrent TTS requests for parallel processing. | 10 |
--log_file <path> |
Path to log file. Auto-creates <output_name>.log next to output file if not specified. |
Auto-generated |
--log_level <level> |
Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL |
INFO |
--retries <num> |
Number of retry attempts for network failures during TTS generation. | 10 |
--format <ext> |
Force output format (wav, m4a, opus). Appends extension if needed. |
None (WAV) |
Example: Full Synchronization Workflow To guarantee the output audio is exactly the same length as your video:
1
2
3
4
5
6
7
8
9
10
11
# Scenario A: You have the video locally
python src/main.py subtitles.srt dub.wav --ref_video original_movie.mp4
# Scenario B: You know the video duration (No video file needed)
# You can provide the duration in "HH:MM:SS" format or total seconds.
# Option 1: HH:MM:SS.mmm (e.g., 1 hour, 30 mins, 5.123 seconds)
python src/main.py subtitles.srt dub.wav --expected_duration "01:30:05.123"
# Option 2: Seconds (e.g., 90 minutes)
python src/main.py subtitles.srt dub.wav --expected_duration 5400.5
Example: Logging Options Control logging output and verbosity:
1
2
3
4
5
6
7
8
9
10
11
# Default: Creates output.log with INFO level
python src/main.py subtitles.srt output.wav --voice en-US-JennyNeural
# Custom log file location
python src/main.py subtitles.srt output.wav --log_file ~/logs/dubbing.log
# Debug level for troubleshooting
python src/main.py subtitles.srt output.wav --log_level DEBUG
# Minimal logging (errors only)
python src/main.py subtitles.srt output.wav --log_level ERROR
Utility Tips
Getting Video Duration with ffprobe
If you need to find the exact duration of a video file for the --expected_duration parameter:
1
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 video.mp4
This will output the duration in seconds (e.g., 5400.5).
Finding Voices
List all available voices:
1
edge-tts --list-voices
Performance Optimizations
The tool includes several optimizations for fast processing:
Async Batch Processing
2-3x faster generation through parallel TTS requests:
1
2
3
4
5
6
7
8
# Default: 10 concurrent requests
python src/main.py subtitles.srt output.wav
# Faster on good networks (20 concurrent)
python src/main.py subtitles.srt output.wav --batch_size 20
# Safer on slow networks (5 concurrent)
python src/main.py subtitles.srt output.wav --batch_size 5
How it works:
- Segments are generated in parallel batches instead of sequentially
- Configurable batch size to balance speed vs network load
- Progress shown per batch
Smart Text Caching
20-75% faster on files with repeated text:
- Automatically detects identical subtitle text
- Generates TTS once, reuses for all duplicates
- Cache stored in temp directory (auto-cleaned unless
--keep-temp)
Example:
1
2
3
4
100 segments with common phrases:
- "Yes" appears 15 times → Generated once, cached 14 times
- "Thank you" appears 10 times → Generated once, cached 9 times
Result: 25% fewer TTS requests!
Combined Performance
Expected speed improvements:
- Small files (< 50 segments): 2-3x faster
- Large files (500+ segments): 3-5x faster
- Repetitive content: Up to 5x faster
Progress Statistics
The tool provides detailed progress information:
1
2
3
4
5
6
7
8
9
10
11
12
13
============================================================
Processing Summary:
Total segments: 100
Generated: 65 # New TTS audio created
Cached (text reuse): 20 # Duplicates reused from cache
Resumed: 15 # Existing files from previous run
Empty subtitles: 2 # Blank subtitle entries
Failed (using silence): 0
Overlaps detected: 1
Late starts (speed-up): 1
Output audio duration: 3645.23s
Target match accuracy: 99.97%
============================================================
Statistics explained:
- Generated: Unique TTS audio files created this run
- Cached: Segments reused from smart caching (same text)
- Resumed: Files from previous interrupted run (with
--resume) - Target match accuracy: How closely output matches expected duration
Technical Details
Audio Processing Pipeline
The tool uses a sophisticated audio processing pipeline for maximum quality and synchronization:
graph TD
classDef decision stroke:#2196F3,stroke-width:2px,fill:none
classDef action stroke:#4CAF50,stroke-width:2px,fill:none
Start([Start]) --> Init["Setup Logging & Directories"]
subgraph Initialization["1: Analysis & Setup"]
Init --> ParseSRT["Parse SRT Subtitles"]
ParseSRT --> TargetDur["Determine Target Duration (Ref Video/Arg)"]
TargetDur --> HashCheck["Generate MD5 Hash per Unique Text"]
end
subgraph Resilient_Generation["2: Deduplicated Async TTS"]
HashCheck --> Dedup["Skip Existing Cache or Duplicate Text"]
Dedup --> AsyncBatch["Async Batch TTS (Edge-TTS)"]
AsyncBatch --> RetryLogic{Network Fail?}
RetryLogic -- Yes --> Backoff["Exponential Backoff Retry"]
Backoff --> AsyncBatch
RetryLogic -- No --> SaveCache["Store in MD5 Cache"]
SaveCache --> CopyToRaw["Copy Cache to Segment Index (raw_i.mp3)"]
end
subgraph Processing["3: Sample-Accurate Sync"]
CopyToRaw --> Loop{Subtitle Loop}
Loop --> SyncCheck{Is Current Head Late?}
SyncCheck -- Yes --> CatchUp["Force Max Speed (max_speed factor)"]
SyncCheck -- No --> Gap{Silence Needed?}
Gap -- Yes --> Silence["Insert Zero-Buffer (Gap-Fill)"]
Gap -- No --> Fit
CatchUp --> Fit["audiostretchy: Stretch to Target Duration"]
Silence --> Fit
Fit --> Precision["Trim/Pad to Exact Sample Count"]
Precision --> Accumulate["Concatenate to Numpy Array"]
Accumulate --> Loop
end
subgraph Export["4: Finalization"]
Loop -- Finished --> Padding["Add Final Padding to Match Ref Video"]
Padding --> FormatCheck{Format Override?}
FormatCheck -- Yes --> FFmpeg["Convert via FFmpeg (AAC/Opus)"]
FormatCheck -- No --> ExportWav["Final SF Write (WAV)"]
FFmpeg --> Cleanup["Remove Temp Directory"]
ExportWav --> Cleanup
Cleanup --> End([End])
end
class RetryLogic,Loop,SyncCheck,Gap,FormatCheck decision
class AsyncBatch,ExportWav,FFmpeg,End action
- TTS Generation: Uses Microsoft Edge TTS to generate MP3 audio for each subtitle segment
- Time-Stretching: Uses
audiostretchylibrary to adjust audio duration while maintaining quality - Sample-Accurate Concatenation: Numpy arrays ensure precise timing at the sample level (24kHz)
- List-Based Accumulation: Segments are stored in a list and concatenated once, avoiding O(N²) memory complexity
- Exact Trimming/Padding: Final audio is trimmed or padded to exact sample count to prevent drift
Memory Optimization
For long videos with many subtitles, the tool uses a list-based buffer instead of repeated numpy.concatenate() calls. This prevents performance degradation and memory issues that would occur with the naive approach.
Memory Usage:
- Minimal footprint: List-based buffer prevents memory bloat
- Scales linearly with file length
- Tested with 1000+ segment files without issues
Late-Start Handling
If a subtitle starts late (overlaps with previous audio), the tool automatically:
- Detects the overlap and issues a warning
- Forces maximum speed compression (up to
--max_speedfactor) - Continues processing to maintain overall synchronization
Supported Voices (Selection)
| Name | Gender | Category |
|---|---|---|
| English (US) | Â | Â |
en-US-JennyNeural |
Female | General |
en-US-ChristopherNeural |
Male | News |
en-US-GuyNeural |
Male | News |
| Turkish | Â | Â |
tr-TR-AhmetNeural |
Male | General |
tr-TR-EmelNeural |
Female | General |
| Chinese | Â | Â |
zh-CN-XiaoxiaoNeural |
Female | Warm |
zh-CN-YunyangNeural |
Male | Professional |
(Run edge-tts --list-voices for the full list)
License
This project is licensed under the MIT License. See the LICENSE file for details or visit the repository.
