Pro Audio Stem Separator - Powered by Demucs

Behind the Technology

The journey to build this tool started out of frustration with existing services. Most free stem separators put limits on file size, force you to pay for high-quality downloads, or throttle your speed. I wanted to democratize access to SOTA (State of the Art) audio separation by leveraging open-source technology and running it on my own hardware—an Intel iMac acting as a personal server.

At the core of this platform is Demucs, a deep learning architecture developed by Meta Research. Unlike traditional EQ-based isolation, Demucs doesn't just cut frequencies; it "dreams" the missing parts of the waveform. It uses a U-Net architecture combined with BiLSTM (Bidirectional Long Short-Term Memory) layers. The AI has been trained on terabytes of studio-quality stems, allowing it to understand the sonic texture of a voice versus a violin, even when they occupy the same frequency range.

graph LR A[Input Audio Mix] -->|Encoder| B(Spectral Analysis) B --> C{U-Net Core} C -->|BiLSTM Layers| D[Pattern Recognition] D -->|Decoder| E[Waveform Synthesis] E --> F((Vocals)) E --> G((Drums)) E --> H((Bass)) E --> I((Other)) style A fill:#252a40,stroke:#00f2ff,color:#fff style C fill:#0f111a,stroke:#ff0055,color:#fff style F fill:#252a40,stroke:#00ff9d,color:#fff style G fill:#252a40,stroke:#00ff9d,color:#fff style H fill:#252a40,stroke:#00ff9d,color:#fff style I fill:#252a40,stroke:#00ff9d,color:#fff

We offer two distinct models. The "Speed Mode" uses the standard htdemucs (Hybrid Transformer) model. It balances speed and accuracy by using a transformer attention mechanism to handle long-range dependencies in the music, like a consistent drum beat or a repetitive bassline. However, for the audiophiles, we implemented the htdemucs_ft (Fine-Tuned) model.

The Fine-Tuned model is significantly heavier computationally. It takes the base knowledge of the standard model and refines it with an extended training dataset focused on minimizing "bleed" (e.g., hearing the hi-hats in the vocal track). When you select "Ultra Quality," the backend switches to this heavier neural network, utilizing 32-bit floating point precision internally before encoding down to CD-quality 16-bit WAV files for your download.

To make this available securely over the public internet without expensive cloud GPU costs, I utilized Tailscale Funnel. This creates an encrypted tunnel from your device directly to my local Python Flask server. Your audio travels through this secure pipeline, gets processed in RAM to protect your privacy (no files are permanently stored), and the separated stems are zipped and sent back to you instantly.

sequenceDiagram participant User as User Device participant Cloud as Tailscale Tunnel participant Server as iMac Backend participant AI as Demucs Engine User->>Cloud: Upload Audio (Encrypted) Cloud->>Server: Route to Flask Server->>AI: Load Audio to RAM AI->>AI: Inference (htdemucs_ft) AI-->>Server: 4 Raw Stem Streams Server->>Server: Encode WAV & Zip Server-->>User: Download Final Zip

Deep Learning Architecture: The U-Net & BiLSTM

To achieve such clean separation, Demucs employs a specialized U-Net topology. Imagine a "U" shape: on the left side (the Encoder), the audio spectrogram is progressively downsampled, compressing the complex waveform into high-level features. On the right side (the Decoder), it attempts to reconstruct the specific stems from those features. The magic lies in the Skip Connections (represented by dotted lines below). These connections pass high-resolution details from the start of the process directly to the end, bypassing the compression bottleneck. This ensures that the output vocals remain crisp and the drums retain their transient punch, rather than sounding "muddy" or over-processed.

Sandwiched between the encoder and decoder is the BiLSTM (Bidirectional Long Short-Term Memory) core. While standard convolutional layers are great at analyzing sound textures (like the timbre of a guitar), they struggle with time and rhythm. The BiLSTM analyzes the song chronologically—scanning both forwards and backwards simultaneously. This allows the AI to understand musical context, keeping the tempo consistent and distinguishing between a rhythmic drum beat and a random noise artifact, effectively "listening" to the song structure before making separation decisions. To use this service phrase disable ad-blockers, future updates will restrict use for any browsers using ad blockers.

graph TD subgraph Encoder ["📉 Encoder (Compression)"] I[Input Mix] --> C1[Conv Layer 1] C1 --> C2[Conv Layer 2] C2 --> C3[Conv Layer 3] C3 --> C4[Conv Layer 4] end subgraph Bottleneck ["🧠 BiLSTM Core (Time Context)"] C4 --> B[Forward/Backward Analysis] end subgraph Decoder ["📈 Decoder (Reconstruction)"] B --> U4[UpSample 4] U4 --> U3[UpSample 3] U3 --> U2[UpSample 2] U2 --> U1[Output Stems] end %% Skip Connections (The Secret Sauce) C1 -.->|Skip Connection| U1 C2 -.->|Skip Connection| U2 C3 -.->|Skip Connection| U3 C4 -.->|Skip Connection| U4 style Encoder fill:#1a1d2d,stroke:#ff0055,stroke-width:2px style Decoder fill:#1a1d2d,stroke:#00f2ff,stroke-width:2px style Bottleneck fill:#252a40,stroke:#ffb700,stroke-width:2px style I fill:#fff,color:#000 style U1 fill:#fff,color:#000

Audio Stem Separator V 1.5

Unlimited & 100% Free

Drag & Drop Audio File

Separated Stems

Behind the Technology

Deep Learning Architecture: The U-Net & BiLSTM