Vovsoft Speech to Text Converter supports offline and online speech engines.
Offline speech recognition engines do not require internet connection and they run locally on your computer. However, they tend to be less accurate than online models, especially with complex speech or accents. Additionally, they can consume more local resources, potentially slowing down your system.
Vosk is a speech recognition toolkit that works offline, supporting 20+ languages. It performs speech-to-text conversion on your own computer. No file is sent to internet in any case.
Vosk requires a "models" directory, which includes language data. Vovsoft Speech to Text Converter embeds lightweight English and French models. If you need other languages or not satisfied with the results, please follow the steps:
👉 Please note that .NET Framework 4.8 is required to use this feature. This version of the .NET Framework comes preinstalled on most Windows 10 and Windows 11 systems.
Continuous Dictation requires the "Microsoft Speech Platform", which is preinstalled on most systems. This feature supports English, French, German, Japanese, Simplified Chinese, Spanish, and Traditional Chinese.
How to change speech recoginiton settings:
If no speech recognizer is installed on your system, please follow the steps:
If you want to perform speech-to-text conversion on cloud servers instead of your own computer and take advantage of the latest AI advancements, you will need credentials from at least one of these providers:
API Provider | Pricing | Free Tier | Credit Card |
---|---|---|---|
1. OpenAI | $0.0060 per minute | No | Required |
2. Deepgram | $0.0044 per minute | $200 free credit ✔️ | Not Required 😊 |
3. Microsoft Azure | $0.0167 per minute | 300 minutes per month ✔️ | Required |
4. IBM Cloud | $0.0100 per minute | 500 minutes per month ✔️ | Required |
In order to get your OpenAI API key, please follow these steps:
👉 What is the maximum file size?
OpenAI Whisper's file size limit is 26,214,400 bytes (25MB).
👉 What is API Temperature?
The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
In order to get your Deepgram API key, please follow these steps:
In order to get your Microsoft Azure API key and API Region, please follow these steps:
Credentials screen of Microsoft Azure
Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Maltese, Marathi, Mongolian, Nepali, Norwegian Bokmål, Pashto, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Uzbek, Vietnamese, Welsh, and Zulu are supported in Microsoft Azure Cognitive Services.
In order to get your IBM Cloud API key and API URL, please follow these steps:
Credentials screen of IBM Cloud
Enter your API key and URL into the Settings panel inside "Vovsoft Speech to Text Converter". The software is now ready to convert audio to text.
English, Arabic, Chinese (Mandarin), Czech, Dutch, French, German, Hindi (Indian), Italian, Japanese, Korean, Portuguese (Brazilian) and Spanish are supported in IBM Cloud API.
For most languages, the IBM Cloud service supports broadband, narrowband, telephony and multimedia models:
Choosing the correct model is important. Use the model that matches the sampling rate (and language) of your audio. The service automatically adjusts the sampling rate of your audio to match the model that you specify. More information: https://cloud.ibm.com/apidocs/speech-to-text
Conversion times are listed in the table below. Please note that the specified times vary depending on the content of the file, its quality, language model, load of the AI servers and your computer's upload speed.
Audio Length | Audio Quality | Language Model | Approximate Conversion Time |
5 minutes | 48 kHz Stereo | English (Broadband) | 1 minute and 20 seconds |
5 minutes | 8 kHz Mono | English (Narrowband) | 1 minute and 30 seconds |
30 minutes | 48 kHz Stereo | English (Broadband) | 9 minutes |
30 minutes | 8 kHz Mono | English (Narrowband) | 10 minutes |
HTTP/1.1 503 Service Unavailable
Your URL is wrong. Please enter the exact "API Key" and "API URL" that was supplied for you by IBM Cloud.
Error sending data: (12030) The connection with the server was terminated abnormally
A firewall, proxy or antivirus interferes with the connection. Please try to disable them or use different internet connection.
Error reading data: (12152) The server returned an invalid or unrecognized response
Your audio is too long. Please try to convert a shorter audio.
"Please wait" hangs, nothing happens
The file size of your audio is too large. Please try to upload a smaller file. Converting stereo to mono may help.