Table Providers
The JsonStreamTableProvider bridges the gap between streaming HTTP data and DataFusion's SQL engine.
The Concept
DataFusion typically reads from files (CSV, Parquet). To read from an API stream, ApiTap implements a custom TableProvider.
HTTP Stream
Data Source
Stream Factory
Buffer
TableProvider
Bridge
DataFusion SQL
Query Engine
Zero-Copy Streaming
Process Data In-Flight
Unlike other ETL tools that download to disk first
Memory Efficient
Only holds a small buffer (e.g., 256 items) in memory
Lazy Evaluation
Stream is only consumed when SQL query requests data
The Stream Factory
Because SQL engines might need to restart a scan (e.g., for self-joins or multi-pass algorithms), we use a Factory Pattern.
rust
type JsonStreamFactory = Arc<
dyn Fn() -> Pin<Box<dyn Stream<Item = Result<Value>> + Send>>
>;This factory allows DataFusion to "replay" the stream from the in-memory buffer without re-triggering the HTTP request.
Usage Example
Register API as Table
Internal process for creating queryable tables
rust
// 1. Create the provider
let provider = JsonStreamTableProvider::new(factory, schema);
// 2. Register it as a table named "users"
ctx.register_table("users", Arc::new(provider))?;
// 3. Now you can query it!
ctx.sql("SELECT * FROM users WHERE active = true").await?;