Vovsoft Speech to Text Converter requires credentials for one of these providers:
- Microsoft Azure Cognitive Services API, which can convert up to 300 minutes per month for free
- IBM Cloud Speech to Text API, which can convert up to 500 minutes per month for free
⚠️ IBM Cloud and Microsoft Azure may require a valid credit card for registration and may not be available in some countries such as China and Taiwan.
How to get Microsoft Azure API Key and API Region
In order to get your Microsoft Azure API key and API Region, please follow these steps:
- Go to https://portal.azure.com and create your Microsoft Azure account for free.
- Click Create a resource.
- Choose Cognitive Services.
- After you created your Speech Service, your credentials can be found in Keys and Endpoint page: KEY1, KEY2 (any one of the keys should work) and Location/Region fields
How to get IBM Cloud API Key and API URL
In order to get your IBM Cloud API key and API URL, please follow these steps:
- Go to https://cloud.ibm.com/registration and create your IBM Cloud account for free.
Please Note: If you get this error message "Your account cannot be created at this time", use your Gmail email address. It seems like IBM Watson doesn't like some email providers. So, if you have any such problems, just use another email address.
- Go to https://cloud.ibm.com/catalog/services/speech-to-text and create your Speech to Text Lite Plan instance.
- Go to https://cloud.ibm.com/resources; under AI / Machine Learning tab, click on your Speech to Text instance. Your credentials (API key and URL) will be displayed in Manage or Service credentials page.
Enter your API key and URL into the Settings panel inside "Vovsoft Speech to Text Converter". The software is now ready to convert audio to text.
Languages and Models
English, Arabic, Chinese (Mandarin), Czech, Dutch, French, German, Hindi (Indian), Italian, Japanese, Korean, Portuguese (Brazilian) and Spanish are supported.
For most languages, the IBM Cloud service supports broadband, narrowband, telephony and multimedia models:
- Broadband models are for audio that is sampled at greater than or equal to 16 kHz.
- Narrowband models are for audio that is sampled at 8 kHz. Use narrowband models for offline decoding of telephone speech, which is the typical use for this sampling rate.
- Telephony models are intended specifically for audio that is communicated over a telephone. Like previous-generation narrowband models, telephony models are intended for audio that has a minimum sampling rate of 8 kHz.
- Multimedia models are intended for audio that is extracted from sources with a higher sampling rate, such as video. Use a multimedia model for any audio other than telephonic audio. Like previous-generation broadband models, multimedia models are intended for audio that has a minimum sampling rate of 16 kHz.
Choosing the correct model is important. Use the model that matches the sampling rate (and language) of your audio. The service automatically adjusts the sampling rate of your audio to match the model that you specify.
Approximate Conversion Time
Conversion times are listed in the table below. Please note that the specified times vary depending on the content of the file, its quality, language model, load of the AI servers and your computer's upload speed.
||Approximate Conversion Time
||48 kHz Stereo
||1 minute and 20 seconds
||8 kHz Mono
||1 minute and 30 seconds
||48 kHz Stereo
||8 kHz Mono
HTTP/1.1 503 Service Unavailable
Your URL is wrong. Please enter the exact "API Key" and "API URL" that was supplied for you by IBM Cloud.
Error reading data: (12152)
Your audio is too long. Please try to convert a shorter audio.
"Please wait" hangs, nothing happens
The file size of your audio is too large. Please try to upload a smaller file. Converting stereo to mono may help.