Skip to content

What is Syda?

Syda seamlessly generate realistic synthetic test data - structured, unstructured, PDF, and HTML data generation with AI and large language models while preserving referential integrity, maintaining privacy compliance, and accelerating development workflows using OpenAI, Anthropic, Gemini, and Grok models

Key Features

  • Multi-Provider AI Integration:

    • Seamless integration with multiple AI providers
    • Support for OpenAI (GPT), Anthropic (Claude), Google (Gemini), and xAI (Grok).
    • Default model is Anthropic Claude model claude-3-5-haiku-20241022
    • Consistent interface across different providers
    • Provider-specific parameter optimization
  • LLM-based Data Generation:

    • AI-powered schema understanding and data creation
    • Contextually-aware synthetic records
    • Natural language prompt customization
    • Intelligent schema inference
  • Multiple Schema Formats:

    • YAML/JSON schema file support with full foreign key relationship handling
    • SQLAlchemy model integration with automatic metadata extraction
    • Python dictionary-based schema definitions
  • Referential Integrity

    • Automatic foreign key detection and resolution
    • Multi-model dependency analysis through topological sorting
    • Robust handling of related data with referential constraints
  • Custom Generators

    • Register column- or type-specific functions for domain-specific data
    • Contextual generators that adapt to other fields (like ICD-10 codes based on demographics)
  • Large Dataset Generation

    • Chunked direct mode: splits large row counts into batch_size LLM calls with exponential-backoff retry
    • Code-gen mode: LLM writes Python generator functions for simple columns; only semantic/narrative columns call the LLM at runtime — dramatically fewer API calls at scale
    • Auto-select: direct for ≤ 500 rows, codegen for > 500 rows (configurable)
    • Code-gen cache: generated .py files saved under output_dir/.syda_cache/ — re-runs are instant on schema cache hits
    • force_llm: true column flag forces specific columns to always use LLM generation, even in code-gen mode
    • CLI flags: --batch-size N and --large-dataset for terminal-based large dataset workflows
  • Observability & Cost Tracking

    • Per-run RunReport with per-table and per-column breakdown of strategy, token counts, and cost
    • Cost calculated automatically for all supported providers and models
    • HTML run report auto-saved to output_dir/run_report_<timestamp>.html after every run
    • Access programmatically via generator.last_report after any generation