Tutorial: How to Install Whisper AI

Whisper AI is a powerful speech-to-text model by OpenAI that allows for high-quality transcription. This guide walks you through the step-by-step installation process.


Step 1: Install Python

Whisper AI requires Python to run.

  • Download Python from python.org.
  • Ensure you install Python 3.8 or later (Whisper supports up to Python 3.11).
  • During installation, check the box to add Python to PATH.
  • Verify installation by running: python --version

Step 2: Install PyTorch

Whisper AI depends on PyTorch for deep learning functionalities.

  • Visit pytorch.org and follow the instructions for your system.
  • Example installation command: pip install torch torchvision torchaudio
  • Verify installation: python -c "import torch; print(torch.__version__)"

Step 3: Install Chocolatey (Windows Users Only)

Chocolatey is a package manager for Windows.

  • Open PowerShell as an administrator and run: Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
  • Verify installation: choco --version

Step 4: Install FFmpeg

FFmpeg is required for handling audio files.

  • Windows (via Chocolatey): choco install ffmpeg
  • macOS: brew install ffmpeg
  • Linux (Ubuntu/Debian): sudo apt update && sudo apt install ffmpeg
  • Verify installation: ffmpeg -version

Step 5: Install Whisper AI

  • Run the installation command: pip install -U openai-whisper
  • Alternatively, install directly from GitHub for the latest version: pip install git+https://github.com/openai/whisper.git
  • To update Whisper AI: pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
  • Verify installation: whisper --help

Step 6: Run a Test Transcription

To check if Whisper AI is working, run:

whisper example.mp3 --model small

This will generate a transcription of example.mp3.


Additional Features

  • Use Different Models: Whisper supports multiple models (tiny, small, medium, large). Example: whisper example.mp3 --model medium
  • Transcribe Multiple Files: whisper file1.mp3 file2.mp3
  • Specify Language: whisper example.mp3 --language English
  • Translate Non-English Speech to English: whisper example.mp3 --task translate

CUDA Compatibility for GPU Acceleration

If you have an NVIDIA GPU, you can speed up transcription with CUDA:

  • Install CUDA-compatible PyTorch by selecting the correct version from pytorch.org.
  • Install NVIDIA drivers and CUDA from NVIDIA.
  • Run Whisper using CUDA: whisper example.mp3 --model large --device cuda

Congratulations! 🎉 You have successfully installed and set up Whisper AI for transcription. For further details, visit the Whisper GitHub repository.

FFmpeg Command Breakdown for Extracting Slides from a PowerPoint Video

Command:

ffmpeg -i "PresentationVideo.mp4" -filter_complex "select=gt(scene\,0.2)" "slides/%04d.jpg" -vsync vfr

Explanation of Each Component:

1. ffmpeg

  • The command-line tool used for video and audio processing.

2. -i "PresentationVideo.mp4"

  • Specifies the input video file: PresentationVideo.mp4 (the recorded PowerPoint or slide presentation).

3. -filter_complex "select=gt(scene\,0.2)"

  • -filter_complex: Enables complex filtering.
  • select=gt(scene,0.2):
    • Uses the scene detection filter to extract frames when significant slide transitions occur.
    • scene is a built-in metric that detects the difference between consecutive frames.
    • gt(scene,0.2):
      • gt() (greater than) selects frames where the scene change metric exceeds 20% (0.2).
      • This ensures that only major slide transitions are captured, avoiding minor visual changes.

4. "slides/%04d.jpg"

  • Saves the extracted slides as .jpg images in the slides/ directory. You must create this directory before using the command, or it will not work
  • %04d ensures images are numbered sequentially (e.g., 0001.jpg, 0002.jpg).

5. -vsync vfr

  • Ensures that only variable frame rate (VFR) frames matching the scene filter are kept, preventing unnecessary duplicate frames.

Effect of Changing the Scene Threshold (scene value)

Scene Change ThresholdEffect
0.05 (5%)Many frames extracted, including minor slide changes (animations, small transitions).
0.1 (10%)Fewer frames, capturing most major slide changes.
0.2 (20%)Extracts only clear slide transitions, ignoring minor visual changes.
0.3 (30%)Very few frames, capturing only the most drastic slide transitions.

Example with a Lower Scene Change Value

ffmpeg -i "PresentationVideo.mp4" -filter_complex "select=gt(scene\,0.1)" "slides/%04d.jpg" -vsync vfr
  • This command captures more frequent slide changes, useful if the slides have animations or frequent minor transitions.

Key Takeaways:

  • Lower scene values (e.g., 0.05–0.1) → Extract more frames, including small slide changes.
  • Higher scene values (e.g., 0.2–0.3) → Extract fewer frames, focusing on major slide transitions.
  • Adjust the value based on the type of PowerPoint presentation (animated vs. static slides).