Merge branch 'main' into main

This commit is contained in:
RVC-Boss
2024-01-29 23:25:48 +08:00
committed by GitHub
51 changed files with 5441 additions and 1101 deletions

View File

@@ -8,8 +8,9 @@ A Powerful Few-shot Voice Conversion and Text-to-Speech WebUI.<br><br>
<img src="https://counter.seku.su/cmoe?name=gptsovits&theme=r34" /><br>
[![Open In Colab](https://img.shields.io/badge/Colab-F9AB00?style=for-the-badge&logo=googlecolab&color=525252)](https://colab.research.google.com/github/RVC-Boss/GPT-SoVITS/blob/main/colab_webui.ipynb)
[![Licence](https://img.shields.io/badge/LICENSE-MIT-green.svg?style=for-the-badge)](https://github.com/RVC-Boss/GPT-SoVITS/blob/main/LICENSE)
[![Huggingface](https://img.shields.io/badge/🤗%20-Spaces-yellow.svg?style=for-the-badge)](https://huggingface.co/lj1995/GPT-SoVITS/tree/main)
[![Huggingface](https://img.shields.io/badge/🤗%20-Models%20Repo-yellow.svg?style=for-the-badge)](https://huggingface.co/lj1995/GPT-SoVITS/tree/main)
[**English**](./README.md) | [**中文简体**](./docs/cn/README.md) | [**日本語**](./docs/ja/README.md)
@@ -22,8 +23,12 @@ A Powerful Few-shot Voice Conversion and Text-to-Speech WebUI.<br><br>
> Check out our [demo video](https://www.bilibili.com/video/BV12g4y1m7Uw) here!
Unseen speakers few-shot fine-tuning demo:
https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb
For users in China region, you can use AutoDL Cloud Docker to experience the full functionality online: https://www.codewithgpu.com/i/RVC-Boss/GPT-SoVITS/GPT-SoVITS-Official
## Features:
1. **Zero-shot TTS:** Input a 5-second vocal sample and experience instant text-to-speech conversion.
@@ -41,6 +46,7 @@ If you are a Windows user (tested with win>=10) you can install directly via the
- Python 3.9, PyTorch 2.0.1, CUDA 11
- Python 3.10.13, PyTorch 2.1.2, CUDA 12.3
- Python 3.9, PyTorch 2.3.0.dev20240122, macOS 14.3 (Apple silicon, GPU)
_Note: numba==0.56.4 require py<3.11_
@@ -52,24 +58,11 @@ conda activate GPTSoVits
bash install.sh
```
### Install Manually
#### Make sure you have the distutils for python3.9 installed
```bash
sudo apt-get install python3.9-distutils
```
#### Pip Packages
```bash
pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm cn2an pypinyin pyopenjtalk g2p_en chardet transformers
```
#### Additional Requirements
If you need Chinese ASR (supported by FunASR), install:
```bash
pip install modelscope torchaudio sentencepiece funasr>=1.0.0
pip install -r requirements.txt
```
#### FFmpeg
@@ -106,6 +99,52 @@ For Chinese ASR (additionally), download models from [Damo ASR Model](https://mo
For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from [UVR5 Weights](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights) and place them in `tools/uvr5/uvr5_weights`.
### For Mac Users
If you are a Mac user, make sure you meet the following conditions for training and inferencing with GPU:
- Mac computers with Apple silicon or AMD GPUs
- macOS 12.3 or later
- Xcode command-line tools installed by running `xcode-select --install`
_Other Macs can do inference with CPU only._
Then install by using the following commands:
#### Create Environment
```bash
conda create -n GPTSoVits python=3.9
conda activate GPTSoVits
```
#### Install Requirements
```bash
pip install -r requirements.txt
pip uninstall torch torchaudio
pip3 install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
```
### Using Docker
#### docker-compose.yaml configuration
0. Regarding image tags: Due to rapid updates in the codebase and the slow process of packaging and testing images, please check [Docker Hub](https://hub.docker.com/r/breakstring/gpt-sovits) for the currently packaged latest images and select as per your situation, or alternatively, build locally using a Dockerfile according to your own needs.
1. Environment Variables
- is_half: Controls half-precision/double-precision. This is typically the cause if the content under the directories 4-cnhubert/5-wav32k is not generated correctly during the "SSL extracting" step. Adjust to True or False based on your actual situation.
2. Volumes ConfigurationThe application's root directory inside the container is set to /workspace. The default docker-compose.yaml lists some practical examples for uploading/downloading content.
3. shm_size The default available memory for Docker Desktop on Windows is too small, which can cause abnormal operations. Adjust according to your own situation.
4. Under the deploy section, GPU-related settings should be adjusted cautiously according to your system and actual circumstances.
#### Running with docker compose
```
docker compose -f "docker-compose.yaml" up -d
```
#### Running with docker command
As above, modify the corresponding parameters based on your actual situation, then run the following command:
```
docker run --rm -it --gpus=all --env=is_half=False --volume=G:\GPT-SoVITS-DockerTest\output:/workspace/output --volume=G:\GPT-SoVITS-DockerTest\logs:/workspace/logs --volume=G:\GPT-SoVITS-DockerTest\SoVITS_weights:/workspace/SoVITS_weights --workdir=/workspace -p 9870:9870 -p 9871:9871 -p 9872:9872 -p 9873:9873 -p 9874:9874 --shm-size="16G" -d breakstring/gpt-sovits:xxxxx
```
## Dataset Format
@@ -129,9 +168,9 @@ D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
## Todo List
- [ ] **High Priority:**
- [ ] Localization in Japanese and English.
- [x] Localization in Japanese and English.
- [ ] User guide.
- [ ] Japanese and English dataset fine tune training.
- [x] Japanese and English dataset fine tune training.
- [ ] **Features:**
- [ ] Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
@@ -140,7 +179,7 @@ D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
- [ ] Experiment with changing SoVITS token inputs to probability distribution of vocabs.
- [ ] Improve English and Japanese text frontend.
- [ ] Develop tiny and larger-sized TTS models.
- [ ] Colab scripts.
- [x] Colab scripts.
- [ ] Try expand training dataset (2k hours -> 10k hours).
- [ ] better sovits base model (enhanced audio quality)
- [ ] model mix