Using Anthropic Models with SYDA¶

Source code: examples/model_selection/test_claude_models.py

This example demonstrates how to use Anthropic's Claude models with SYDA for synthetic data generation. Anthropic offers several Claude models with different capabilities, token limits, and price points.

Prerequisites¶

Before running this example, you need to:

Install SYDA and its dependencies
Set up your Anthropic API key in your environment

You can set the API key in your .env file:

ANTHROPIC_API_KEY=your_api_key_here

Or set it as an environment variable before running your script:

export ANTHROPIC_API_KEY=your_api_key_here

Example Code¶

The following example demonstrates how to configure and use different Claude models for synthetic data generation:

from syda.generate import SyntheticDataGenerator
from syda.schemas import ModelConfig
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Define schema for a single table
schemas = {
    'Patient': {
        'patient_id': {'type': 'number', 'description': 'Unique identifier for the patient'},
        'diagnosis_code': {'type': 'text', 'description': 'ICD-10 diagnosis code'},
        'email': {'type': 'email', 'description': 'Patient email address used for communication'},
        'visit_date': {'type': 'date', 'description': 'Date when the patient visited the clinic'},
        'notes': {'type': 'text', 'description': 'Clinical notes for the patient visit'}
    },
    'Claim': {
        'claim_id': {'type': 'number', 'description': 'Unique identifier for the claim'},
        'patient_id': {'type': 'foreign_key', 'description': 'Reference to the patient who made the claim', 'references': {'schema': 'Patient', 'field': 'patient_id'}},
        'diagnosis_code': {'type': 'text', 'description': 'ICD-10 diagnosis code'},
        'email':    {'type': 'email', 'description': 'Patient email address used for communication'},
        'visit_date': {'type': 'date', 'description': 'Date when the patient visited the clinic'},
        'notes': {'type': 'text', 'description': 'Clinical notes for the patient visit'}
    }
}

prompts={
    'Patient': 'Generate realistic synthetic patient records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes.', 
    'Claim': 'Generate realistic synthetic claim records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes.'
}
sample_sizes={'Patient': 15, 'Claim': 15}

print("--------------Testing Claude Haiku----------------")
model_config = ModelConfig(
    provider="anthropic",
    model_name="claude-3-5-haiku-20241022",
    temperature=0.7,
    max_tokens=8192  # Larger value for more complete responses
)

generator = SyntheticDataGenerator(model_config=model_config)
 # Define output directory
output_dir = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 
        "output", 
        "test_claude_models", 
        "haiku-3-5"
)
# Generate and save to CSV
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts=prompts,
    sample_sizes=sample_sizes,
    output_dir=output_dir
)
print(f"Data saved to {output_dir}")


print("--------------Testing Claude Sonnet----------------")
model_config = ModelConfig(
    provider="anthropic",
    model_name="claude-sonnet-4-20250514",
    temperature=0.7,
    max_tokens=64000  # Larger value for more complete responses
)

generator = SyntheticDataGenerator(model_config=model_config)
 # Define output directory
output_dir = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 
        "output", 
        "test_claude_models", 
        "sonnet-4"
)
sample_sizes={'Patient': 100, 'Claim': 200}
# Generate and save to CSV
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts=prompts,
    sample_sizes=sample_sizes,
    output_dir=output_dir
)
print(f"Data saved to {output_dir}")

print("--------------Testing Claude Opus----------------")
model_config = ModelConfig(
    provider="anthropic",
    model_name="claude-opus-4-20250514",
    temperature=0.7,
    max_tokens=32000  # Larger value for more complete responses
)

generator = SyntheticDataGenerator(model_config=model_config)
# Define output directory
output_dir = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 
        "output", 
        "test_claude_models", 
        "opus-4"
)
# Generate and save to CSV
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts=prompts,
    sample_sizes=sample_sizes,
    output_dir=output_dir
)
print(f"Data saved to {output_dir}")

Sample Outputs¶

You can view sample outputs generated by these Claude models here: examples/model_selection/output/test_claude_models

Claude Model Options¶

SYDA supports several Anthropic Claude models with different capabilities and token limits:

claude-3-5-haiku-20241022: Fast and efficient for quick data generation (8,192 tokens)
claude-sonnet-4-20250514: Balance of quality and performance (64,000 tokens)
claude-opus-4-20250514: Highest capability for complex data (32,000 tokens)

Key Concepts¶

Model Configuration¶

The ModelConfig class is used to specify which model to use:

model_config = ModelConfig(
    provider="anthropic",
    model_name="claude-3-5-haiku-20241022",
    temperature=0.7,
    max_tokens=8192
)

provider: Set to "anthropic" to use Claude models
model_name: Specify which Claude model to use
temperature: Controls randomness in generation (0.0-1.0)
max_tokens: Maximum number of tokens in the response

Scaling to Larger Datasets¶

When generating larger datasets, consider using more capable models:

sample_sizes = {'Patient': 100, 'Claim': 200}

The more powerful Claude models (Sonnet and Opus) can handle generating larger datasets in a single request, which is more efficient than making multiple smaller requests.

Output Directory Structure¶

The example code creates an organized directory structure for output files:

output/
├── test_claude_models/
│   ├── haiku-3-5/
│   │   ├── Patient.csv
│   │   └── Claim.csv
│   ├── sonnet-4/
│   │   ├── Patient.csv
│   │   └── Claim.csv
│   └── opus-4/
│       ├── Patient.csv
│       └── Claim.csv

Best Practices¶

Choose the right model for your task:
Use Haiku for small datasets and simple schemas
Use Sonnet for medium-sized datasets with moderate complexity
Use Opus for complex data structures and relationships
Set appropriate token limits: Different models have different token limits. Make sure to set the max_tokens parameter accordingly.
Use detailed prompts: Claude models respond well to specific guidance in prompts. Include details about the type of data you want to generate.
Monitor API usage: Keep track of your API usage to manage costs, especially when working with larger datasets.