What is Syda?¶

Syda seamlessly generate realistic synthetic test data - structured, unstructured, PDF, and HTML data generation with AI and large language models while preserving referential integrity, maintaining privacy compliance, and accelerating development workflows using OpenAI, Anthropic, Gemini, and Grok models

Key Features¶

Multi-Provider AI Integration:
- Seamless integration with multiple AI providers
- Support for OpenAI (GPT), Anthropic (Claude), Google (Gemini), and xAI (Grok).
- Default model is Anthropic Claude model claude-3-5-haiku-20241022
- Consistent interface across different providers
- Provider-specific parameter optimization
LLM-based Data Generation:
- AI-powered schema understanding and data creation
- Contextually-aware synthetic records
- Natural language prompt customization
- Intelligent schema inference
Multiple Schema Formats:
- YAML/JSON schema file support with full foreign key relationship handling
- SQLAlchemy model integration with automatic metadata extraction
- Python dictionary-based schema definitions
Referential Integrity
- Automatic foreign key detection and resolution
- Multi-model dependency analysis through topological sorting
- Robust handling of related data with referential constraints
Custom Generators
- Register column- or type-specific functions for domain-specific data
- Contextual generators that adapt to other fields (like ICD-10 codes based on demographics)
Large Dataset Generation
- Chunked direct mode: splits large row counts into batch_size LLM calls with exponential-backoff retry
- Code-gen mode: LLM writes Python generator functions for simple columns; only semantic/narrative columns call the LLM at runtime — dramatically fewer API calls at scale
- Auto-select: direct for ≤ 500 rows, codegen for > 500 rows (configurable)
- Code-gen cache: generated .py files saved under output_dir/.syda_cache/ — re-runs are instant on schema cache hits
- force_llm: true column flag forces specific columns to always use LLM generation, even in code-gen mode
- CLI flags: --batch-size N and --large-dataset for terminal-based large dataset workflows
Observability & Cost Tracking
- Per-run RunReport with per-table and per-column breakdown of strategy, token counts, and cost
- Cost calculated automatically for all supported providers and models
- HTML run report auto-saved to output_dir/run_report_<timestamp>.html after every run
- Access programmatically via generator.last_report after any generation