If you have ever attempted to build a web-based music or karaoke player, you quickly discover a brutal engineering truth: displaying text on a screen is easy; displaying text in perfect, sub-millisecond synchronization with a dynamic audio track is an absolute nightmare.
At Tristan's Digital Lab, we encountered this exact challenge while developing the TYF MegaOke engine. We needed lyrics to highlight syllable-by-syllable, reacting instantly to tempo changes, file jumps, and browser lag. The solution didn't lie in modern JavaScript trickery, but rather in deeply understanding a protocol created over forty years ago: MIDI.
The setTimeout Fallacy
When junior developers first attempt audio synchronization, they almost always reach for JavaScript's native setTimeout or setInterval functions. The logic seems sound on the surface: "If the singer says 'Hello' at precisely 12.5 seconds into the audio track, I will just set a timeout for 12,500 milliseconds to trigger the CSS highlight."
In a controlled vacuum, this might work. In a web browser, it fails miserably for several reasons:
- The Event Loop:
setTimeoutdoes not guarantee execution at the specified time; it only guarantees the minimum delay before the callback is added to the event queue. If the browser is busy rendering a complex CSS animation or waiting on a fetch request, your "12.5-second" timeout might execute at 12.8 seconds. - Cumulative Drift: If you use intervals to track timing across a 5-minute song, those tiny 5-millisecond delays compound. By the final chorus, your lyrics will be visually lagging an entire second behind the audio.
- Dynamic Playback: What if the user pauses the song? What if they seek backward 30 seconds? Managing hundreds of floating timeout instances and calculating offsets becomes a logistical nightmare.
Understanding the MIDI Protocol and Ticks
To solve the timing problem, we have to look at the structure of the files we are playing. Standard MIDI files (.mid or karaoke .kar variants) do not contain recorded audio. They contain instructions: "Press Note C4", "Wait", "Release Note C4".
Timing in MIDI is not measured in milliseconds; it is measured in Ticks (or Pulses). Every MIDI file has a header that defines its PPQN (Pulses Per Quarter Note). If a file has a PPQN of 480, it means there are exactly 480 ticks in one quarter note beat. This is crucial because it makes the timing relative to the musical tempo, not absolute time.
Where the Lyrics Hide: Sysex and Meta Events
If MIDI files only contain instructions for playing notes, where do the lyrics come from? This is where the brilliance of the MIDI standard shines.
Sequencers and karaoke file creators embed textual data directly into the timeline using Meta Text Events or specialized System Exclusive (Sysex) messages. A Sysex message allows manufacturers to inject custom, non-standard binary data into the MIDI stream. When creating a `.kar` file, every single syllable of a word is packaged into one of these events and assigned a specific MIDI Tick.
For example, the data stream might look like this:
- Tick 0: [Meta Text] "Hel"
- Tick 240: [Note On] Play C4
- Tick 480: [Meta Text] "lo "
- Tick 480: [Note Off] Release C4
Enter SpessaSynth: Bridging the Gap
To extract this embedded data in the browser, we utilize the SpessaSynth parser. This engine does not just play the notes; it acts as a forensic tool, scanning the binary array of the file and extracting every single Sysex and Meta Text event, mapping them to their precise execution ticks.
Instead of relying on unstable JavaScript timers, our web application enters a highly controlled synchronization loop based on the Web Audio API.
The Synchronization Engine
The Web Audio API features its own hardware-backed clock, accessible via audioContext.currentTime. This clock is immune to main-thread JavaScript lag. It keeps marching forward based on the physical audio hardware of the device.
Our SpessaSynth integration constantly monitors this hardware clock and converts it back into MIDI Ticks based on the current tempo. In a high-speed requestAnimationFrame loop, the application asks a simple question: "What is the current MIDI Tick?"
If the engine says we are at Tick 4500, we simply look up our pre-parsed array of syllables. If the syllable "world" is scheduled for Tick 4500, we highlight it. If the user seeks the audio player to Tick 8000, the parser instantly highlights everything up to Tick 8000. There are no timeouts to cancel, no drift to calculate, and no asynchronous nightmares.
The Client-Side Advantage
This strict, tick-based architecture is what makes TYF MegaOke feel like a piece of native hardware rather than a clunky web page. By understanding the low-level binary structure of MIDI files and syncing them against hardware-level audio clocks, we completely bypass the limitations of the JavaScript event loop.
Real-time engineering in the browser is entirely possible, provided you respect the constraints of the environment and build your architecture around deterministic data, not floating timers.