Table Providers

The JsonStreamTableProvider bridges the gap between streaming HTTP data and DataFusion's SQL engine.

The Concept

DataFusion typically reads from files (CSV, Parquet). To read from an API stream, ApiTap implements a custom TableProvider.

HTTP Stream
Data Source
Stream Factory
Buffer
TableProvider
Bridge
DataFusion SQL
Query Engine

Zero-Copy Streaming

Process Data In-Flight

Unlike other ETL tools that download to disk first

Memory Efficient

Only holds a small buffer (e.g., 256 items) in memory

Lazy Evaluation

Stream is only consumed when SQL query requests data

The Stream Factory

Because SQL engines might need to restart a scan (e.g., for self-joins or multi-pass algorithms), we use a Factory Pattern.
rust
type JsonStreamFactory = Arc<
  dyn Fn() -> Pin<Box<dyn Stream<Item = Result<Value>> + Send>> 
>;

This factory allows DataFusion to "replay" the stream from the in-memory buffer without re-triggering the HTTP request.

Usage Example

Register API as Table

Internal process for creating queryable tables

rust
// 1. Create the provider
let provider = JsonStreamTableProvider::new(factory, schema);

// 2. Register it as a table named "users"
ctx.register_table("users", Arc::new(provider))?;

// 3. Now you can query it!
ctx.sql("SELECT * FROM users WHERE active = true").await?;