What is Syda?¶
Syda seamlessly generate realistic synthetic test data - structured, unstructured, PDF, and HTML data generation with AI and large language models while preserving referential integrity, maintaining privacy compliance, and accelerating development workflows using OpenAI, Anthropic, Gemini, and Grok models
Key Features¶
Multi-Provider AI Integration:
- Seamless integration with multiple AI providers
- Support for OpenAI (GPT), Anthropic (Claude), Google (Gemini), and xAI (Grok).
- Default model is Anthropic Claude model claude-3-5-haiku-20241022
- Consistent interface across different providers
- Provider-specific parameter optimization
LLM-based Data Generation:
- AI-powered schema understanding and data creation
- Contextually-aware synthetic records
- Natural language prompt customization
- Intelligent schema inference
Multiple Schema Formats:
- YAML/JSON schema file support with full foreign key relationship handling
- SQLAlchemy model integration with automatic metadata extraction
- Python dictionary-based schema definitions
Referential Integrity
- Automatic foreign key detection and resolution
- Multi-model dependency analysis through topological sorting
- Robust handling of related data with referential constraints
Custom Generators
- Register column- or type-specific functions for domain-specific data
- Contextual generators that adapt to other fields (like ICD-10 codes based on demographics)
Large Dataset Generation
- Chunked direct mode: splits large row counts into
batch_sizeLLM calls with exponential-backoff retry - Code-gen mode: LLM writes Python generator functions for simple columns; only semantic/narrative columns call the LLM at runtime — dramatically fewer API calls at scale
- Auto-select:
directfor ≤ 500 rows,codegenfor > 500 rows (configurable) - Code-gen cache: generated
.pyfiles saved underoutput_dir/.syda_cache/— re-runs are instant on schema cache hits force_llm: truecolumn flag forces specific columns to always use LLM generation, even in code-gen mode- CLI flags:
--batch-size Nand--large-datasetfor terminal-based large dataset workflows
- Chunked direct mode: splits large row counts into
Observability & Cost Tracking
- Per-run
RunReportwith per-table and per-column breakdown of strategy, token counts, and cost - Cost calculated automatically for all supported providers and models
- HTML run report auto-saved to
output_dir/run_report_<timestamp>.htmlafter every run - Access programmatically via
generator.last_reportafter any generation
- Per-run