Transcribe Video to Text with Speech To Text GCP and Sonix AI

A few days ago, I created a post about Convert Audio and Video to Text with Sonix AI, that is not a sponsored post anyway, I just shared what I experienced with 😁

After got Google Cloud Platform credits yesterday, i want to try different way to transcribe video to text, with Speech To Text on GCP.

First, we have to create a Compute Engine VM instance with scopes :

Allow full access to all Cloud APIs

Cloud API access scopes : Allow full access to all Cloud APIs

and then upload the video to your Compute Engine VM, its easy if we use SSH-in-browser, just click the ⬆️and then choose the files you want to upload.

Upload button ⬆️
Upload the video
Video is successfully uploaded and ready to convert to audio

Extract Audio (mp3) from Video

install ffmpeg first with command :

sudo apt install ffmpeg

then we start to extract video to audio by this command :

ffmpeg -i wego-head.mp4 wego-head.mp3

Results :

ariq@instance-1:~$ ffmpeg -i wego-head.mp4 wego-head.mp3
ffmpeg version 3.4.11-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
  libavutil      55. 78.100 / 55. 78.100
  libavcodec     57.107.100 / 57.107.100
  libavformat    57. 83.100 / 57. 83.100
  libavdevice    57. 10.100 / 57. 10.100
  libavfilter     6.107.100 /  6.107.100
  libavresample   3.  7.  0 /  3.  7.  0
  libswscale      4.  8.100 /  4.  8.100
  libswresample   2.  9.100 /  2.  9.100
  libpostproc    54.  7.100 / 54.  7.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'wego-head.mp4':
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf54.29.104
  Duration: 00:00:26.75, start: 0.000000, bitrate: 760 kb/s
    Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p(tv, bt709), 1280x720 [SAR 1:1 DAR 16:9], 624 kb/s, 23.98 fps, 23.98 tbr, 24k tbn, 47.95 tbc (default)
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s (default)
      handler_name    : SoundHandler
Stream mapping:
  Stream #0:1 -> #0:0 (aac (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to 'wego-head.mp3':
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    TSSE            : Lavf57.83.100
    Stream #0:0(und): Audio: mp3 (libmp3lame), 44100 Hz, stereo, fltp (default)
      handler_name    : SoundHandler
      encoder         : Lavc57.107.100 libmp3lame
size=     419kB time=00:00:26.75 bitrate= 128.2kbits/s speed=52.9x    
video:0kB audio:418kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.083098%

Send a Request

Before we use python3 to send a request, we should install python3-pip first on our Compute Engine VM

sudo apt install python3-pip

and install google-cloud-speech package too

sudo pip3 install google-cloud-speech

Create and edit file :


and use the following code to send a transcription request to Speech-to-Text (python codes are provided by google).

#!/usr/bin/env python

# Copyright 2017 Google Inc. All Rights Reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.

"""Google Cloud Speech API sample that demonstrates how to select the model
used for speech recognition.
Example usage:
    python \
        resources/Google_Gnome.wav --model video
    python \
        gs://cloud-samples-tests/speech/Google_Gnome.wav --model video

import argparse

# [START speech_transcribe_model_selection]
def transcribe_model_selection(speech_file, model):
    """Transcribe the given audio file synchronously with
    the selected model."""
    from import speech

    client = speech.SpeechClient()

    with open(speech_file, "rb") as audio_file:
        content =

    audio = speech.RecognitionAudio(content=content)

    config = speech.RecognitionConfig(

    response = client.recognize(config=config, audio=audio)

    for i, result in enumerate(response.results):
        alternative = result.alternatives[0]
        print("-" * 20)
        print("First alternative of result {}".format(i))
        print(u"Transcript: {}".format(alternative.transcript))

# [END speech_transcribe_model_selection]

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
    parser.add_argument("path", help="File or GCS path for audio file to be recognized")
        help="The speech recognition model to use",
        choices=["command_and_search", "phone_call", "video", "default"],

    args = parser.parse_args()

    if args.path.startswith("gs://"):
        transcribe_model_selection_gcs(args.path, args.model)
        transcribe_model_selection(args.path, args.model)

On this post, I want to transcribe an intro of japanese song (early 00:00 to 00:26), so I set language_code="ja-JP" on this case.

config = speech.RecognitionConfig(

Run to start transcribe audio to text with this command :

python3 wego-head.mp3


After we run that command, we can see the output :

Speech to text GCP
ariq@instance-1:~$ python3 wego-head.mp3
First alternative of result 0
Transcript: かつてこの世の全てを手に入れた伝説の海賊王ゴールドロジャー彼の死に際に放った一言は人々を海へと駆り立てた俺の財宝か欲しけりゃくれてやる探せこの世の全てをそこに置いてきた人々はロマンを追い求める

I also have transcribe it too with Sonix AI

wego-head.mp4: Video automatically transcribed by Sonix

wego-head.mp4: this mp4 video file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.


Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp4 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you'd love including automated subtitles, advanced search, upload many different filetypes, automated transcription, and easily transcribe your Zoom meetings. Try Sonix for free today.

Results between Speech-to-Text GCP and Sonix AI.

Speech-to-Text GCP Sonix AI
かつてこの世の全てを手に入れた伝説の海賊王ゴールドロジャー彼の死に際に放った一言は人々を海へと駆り立てた俺の財宝か欲しけりゃくれてやる探せこの世の全てをそこに置いてきた人々はロマンを追い求める かつてこの世のすべてを手に入れた伝説の海賊を、ゴールドロジャーは彼の死に際に放った。一言は人々を海へと駆り立てた往年の財宝欲しけりゃくれてやるんだ探すこの世のすべてをそこに置いてきた人々は、ロマンを追い求める様はまさに大海賊時代。

That's it.

Anyway thanks Google for provide me free credits on GCP,

and thanks Sonix AI for free 30 minutes too 😊🤙

Source :