Audio Stem Separator V 1.5
Drag, Drop, Separate. Professional AI separation.
Unlimited & 100% Free
This is a custom open-source frontend powered by Demucs.
No Paywalls • No File Limits • Professional QualityDrag & Drop Audio File
or click here to browse files
Separated Stems
⚠️ Note: Please wait 1-2 minutes for waveforms to load before downloading.
TIP: (Add to Homescreen to install app)
Behind the Technology
The journey to build this tool started out of frustration with existing services. Most free stem separators put limits on file size, force you to pay for high-quality downloads, or throttle your speed. I wanted to democratize access to SOTA (State of the Art) audio separation by leveraging open-source technology and running it on my own hardware—an Intel iMac acting as a personal server.
At the core of this platform is Demucs, a deep learning architecture developed by Meta Research. Unlike traditional EQ-based isolation, Demucs doesn't just cut frequencies; it "dreams" the missing parts of the waveform. It uses a U-Net architecture combined with BiLSTM (Bidirectional Long Short-Term Memory) layers. The AI has been trained on terabytes of studio-quality stems, allowing it to understand the sonic texture of a voice versus a violin, even when they occupy the same frequency range.
We offer two distinct models. The "Speed Mode" uses the standard htdemucs (Hybrid Transformer) model. It balances speed and accuracy by using a transformer attention mechanism to handle long-range dependencies in the music, like a consistent drum beat or a repetitive bassline. However, for the audiophiles, we implemented the htdemucs_ft (Fine-Tuned) model.
The Fine-Tuned model is significantly heavier computationally. It takes the base knowledge of the standard model and refines it with an extended training dataset focused on minimizing "bleed" (e.g., hearing the hi-hats in the vocal track). When you select "Ultra Quality," the backend switches to this heavier neural network, utilizing 32-bit floating point precision internally before encoding down to CD-quality 16-bit WAV files for your download.
To make this available securely over the public internet without expensive cloud GPU costs, I utilized Tailscale Funnel. This creates an encrypted tunnel from your device directly to my local Python Flask server. Your audio travels through this secure pipeline, gets processed in RAM to protect your privacy (no files are permanently stored), and the separated stems are zipped and sent back to you instantly.
Deep Learning Architecture: The U-Net & BiLSTM
To achieve such clean separation, Demucs employs a specialized U-Net topology. Imagine a "U" shape: on the left side (the Encoder), the audio spectrogram is progressively downsampled, compressing the complex waveform into high-level features. On the right side (the Decoder), it attempts to reconstruct the specific stems from those features. The magic lies in the Skip Connections (represented by dotted lines below). These connections pass high-resolution details from the start of the process directly to the end, bypassing the compression bottleneck. This ensures that the output vocals remain crisp and the drums retain their transient punch, rather than sounding "muddy" or over-processed.
Sandwiched between the encoder and decoder is the BiLSTM (Bidirectional Long Short-Term Memory) core. While standard convolutional layers are great at analyzing sound textures (like the timbre of a guitar), they struggle with time and rhythm. The BiLSTM analyzes the song chronologically—scanning both forwards and backwards simultaneously. This allows the AI to understand musical context, keeping the tempo consistent and distinguishing between a rhythmic drum beat and a random noise artifact, effectively "listening" to the song structure before making separation decisions. To use this service phrase disable ad-blockers, future updates will restrict use for any browsers using ad blockers.