Streaming Architecture

ApiTap is designed for high-performance, memory-efficient streaming. This document explains the internal Factory Pattern that allows DataFusion to execute SQL over HTTP streams without redundant requests.

Architecture Overview

HTTP

Source

Buffer

Factory

Exec

Engine

Store

Postgres

HTTP Layer

  • 1
    Single RequestInitiates one persistent connection to the API, avoiding redundant handshakes.
  • 2
    Bounded BufferStores data in a memory-efficient channel that acts as a stream factory.

DataFusion Engine

  • 1
    Multi-Pass ExecutionReads from the same memory pointer multiple times (inference + execution) without re-fetching.
  • 2
    Zero-Copy WriteStreams transformed Arrow batches directly to Postgres or Parquet.

The Factory Pattern

A stream factory is a function that creates streams on demand, not the stream itself. This allows DataFusion to re-scan data (e.g., for schema inference or multiple passes) without re-triggering the HTTP request.

Key Concept: The HTTP request happens ONCE. The response is buffered in a bounded channel. The Factory simply creates readers for that buffer.

Code Example

rust
// Type definition
pub type JsonStreamFactory = Arc<
    dyn Fn() -> Pin<Box<dyn Stream<Item = Result<Value>> + Send>> 
    + Send + Sync
>;

Memory Efficiency

Because we use bounded channels, memory usage is deterministic regardless of the API response size.

Buffer SizeMemory / Pipeline1000 Pipelines
8192 items~8 MB8 GB
256 items✓ Recommended~256 KB256 MB