KitchenSink Audio Documentation

Welcome to the official documentation for KitchenSink Audio. This library provides a simple, modular framework for building audio processing pipelines.

Core Concepts

The library is built around two fundamental concepts: Sources and Sinks.

  • A Source is where audio comes from. This could be a local microphone (LineInAudioSource), a network connection (like TCPServerAudioSource), or any other origin of audio data.

  • A Sink is where audio goes to. This could be your speakers (AudioPlayerSink), a network connection (like TCPClientAudioSink), or a file on disk.

You connect them by passing a sink’s push_chunk method as the sink argument when creating a source. The source then calls this method to send its audio data to the sink, creating a pipeline.

Middleware and Pipelines

To process or analyze audio between a source and a final sink, you can create “middleware” components. A middleware component is simply a class that acts as both a sink and a source.

  1. It accepts a sink in its constructor, just like a real source.

  2. It has a push_chunk method, just like a real sink.

When its push_chunk method is called, it can perform an action on the audio data (e.g., measure volume, apply an effect, log data) and then pass the chunk along to the next sink in the chain.

This allows you to build complex pipelines:

[Mic Source] -> [Volume Monitor Middleware] -> [Network Sink]

Consuming Chunks

The callable you provide as a sink does not have to be an actual BaseAudioSink object. It can be any function or method that can process a chunk of audio data.

This is useful for when the audio stream is not meant for another destination, but is instead being consumed for analysis. For example, you could have a WebSocket source feed audio chunks directly to a speech-to-text engine:

def speech_to_text_engine(audio_chunk):
    # Process the audio, get transcription...
    transcription = my_stt_library.process(audio_chunk)
    # ...then do something with the text.
    if transcription:
        print(f"Heard: {transcription}")

# The STT function is the "sink" for the audio source.
ws_source = TypedWebSocketServerAudioSource(sink=speech_to_text_engine)

This pattern allows you to use the sources in this library as a generic way to receive audio for any purpose.

Blocksize and Latency

The blocksize parameter, available in most sources and sinks, defines the number of audio frames per chunk. This is a key parameter for controlling latency and performance.

  • In Sources: It determines how frequently the source will generate and push audio chunks to the sink.

  • In Sinks: It serves as a hint to the source about the preferred chunk size for optimal processing (e.g., matching the buffer size of the audio output device).

A smaller blocksize reduces latency but increases the overhead of function calls and network packets. A larger blocksize is more efficient but introduces more delay. The ideal value depends on the application’s requirements.

Indices and tables