Using Google Gemini Models with SYDA¶

Source code: examples/model_selection/test_gemini_models.py

This example demonstrates how to use Google's Gemini models with SYDA for synthetic data generation. Google Gemini offers several models with different capabilities, token limits, and price points.

Prerequisites¶

Before running this example, you need to:

Install SYDA and its dependencies
Set up your Gemini API key in your environment

You can set the API key in your .env file:

GEMINI_API_KEY=your_api_key_here

Or set it as an environment variable before running your script:

export GEMINI_API_KEY=your_api_key_here

Example Code¶

The following example demonstrates how to configure and use different Gemini models for synthetic data generation:

from syda.generate import SyntheticDataGenerator
from syda.schemas import ModelConfig
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Define schema for a single table
schemas = {
    'Patient': {
        'patient_id': {'type': 'number', 'description': 'Unique identifier for the patient'},
        'diagnosis_code': {'type': 'text', 'description': 'ICD-10 diagnosis code'},
        'email': {'type': 'email', 'description': 'Patient email address used for communication'},
        'visit_date': {'type': 'date', 'description': 'Date when the patient visited the clinic'},
        'notes': {'type': 'text', 'description': 'Clinical notes for the patient visit'}
    },
    'Claim': {
        'claim_id': {'type': 'number', 'description': 'Unique identifier for the claim'},
        'patient_id': {'type': 'foreign_key', 'description': 'Reference to the patient who made the claim', 'references': {'schema': 'Patient', 'field': 'patient_id'}},
        'diagnosis_code': {'type': 'text', 'description': 'ICD-10 diagnosis code'},
        'email':    {'type': 'email', 'description': 'Patient email address used for communication'},
        'visit_date': {'type': 'date', 'description': 'Date when the patient visited the clinic'},
        'notes': {'type': 'text', 'description': 'Clinical notes for the patient visit'}
    }
}

prompts={
    'Patient': 'Generate realistic synthetic patient records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes.', 
    'Claim': 'Generate realistic synthetic claim records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes.'
}
sample_sizes={'Patient': 15, 'Claim': 15}

print("--------------Testing Gemini Flash----------------")
model_config = ModelConfig(
    provider="gemini",
    model_name="gemini-2.5-flash",
    temperature=0.7,
    max_tokens=8192  # Larger value for more complete responses
)

generator = SyntheticDataGenerator(model_config=model_config)
 # Define output directory
output_dir = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 
        "output", 
        "test_gemini_models", 
        "flash-2-5"
)
# Generate and save to CSV
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts=prompts,
    sample_sizes=sample_sizes,
    output_dir=output_dir
)
print(f"Data saved to {output_dir}")


print("--------------Testing Gemini 2.0 Flash----------------")
model_config = ModelConfig(
    provider="gemini",
    model_name="gemini-2.0-flash",
    temperature=0.7,
    max_tokens=8192  # Larger value for more complete responses
)

generator = SyntheticDataGenerator(model_config=model_config)
 # Define output directory
output_dir = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 
        "output", 
        "test_gemini_models", 
        "flash-2-0"
)
sample_sizes={'Patient': 50, 'Claim': 75}
# Generate and save to CSV
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts=prompts,
    sample_sizes=sample_sizes,
    output_dir=output_dir
)
print(f"Data saved to {output_dir}")


print("--------------Testing Gemini 2.5 Pro----------------")
model_config = ModelConfig(
    provider="gemini",
    model_name="gemini-2.5-pro",
    temperature=0.7,
    max_tokens=64000  # Larger value for more complete responses
)

generator = SyntheticDataGenerator(model_config=model_config)
# Define output directory
output_dir = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 
        "output", 
        "test_gemini_models", 
        "pro-2-5"
)
sample_sizes={'Patient': 100, 'Claim': 150}  # Pro can handle larger datasets
# Generate and save to CSV
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts=prompts,
    sample_sizes=sample_sizes,
    output_dir=output_dir
)
print(f"Data saved to {output_dir}")

Sample Outputs¶

You can view sample outputs generated by these Gemini models here: examples/model_selection/output/test_gemini_models

Gemini Model Options¶

SYDA supports several Google Gemini models:

gemini-2.5-pro: Latest high-performance model for complex data generation (input: 1,048,576 tokens; output: 65,536 tokens)
gemini-2.5-flash: Latest fast model for efficient data generation (input: 1,048,576 tokens; output: 65,536 tokens)
gemini-2.0-flash: Previous version flash model for quick data generation (input: 1,048,576 tokens; output: 8,192 tokens)

Key Concepts¶

Model Configuration¶

The ModelConfig class is used to specify which model to use:

model_config = ModelConfig(
    provider="gemini",
    model_name="gemini-2.5-flash",
    temperature=0.7,
    max_tokens=8192
)

provider: Set to "gemini" to use Gemini models
model_name: Specify which Gemini model to use
temperature: Controls randomness in generation (0.0-1.0)
max_tokens: Maximum number of tokens in the response

Scaling to Larger Datasets¶

When generating larger datasets, both Gemini Flash models can handle moderate dataset sizes:

sample_sizes = {'Patient': 50, 'Claim': 75}

Both flash models can handle generating moderate-sized datasets efficiently. For very large datasets, consider breaking them into smaller batches.

Output Directory Structure¶

The example code creates an organized directory structure for output files:

output/
├── test_gemini_models/
│   ├── pro-2-5/
│   │   ├── Patient.csv
│   │   └── Claim.csv
│   ├── flash-2-5/
│   │   ├── Patient.csv
│   │   └── Claim.csv
│   └── flash-2-0/
│       ├── Patient.csv
│       └── Claim.csv

Best Practices¶

Choose the right model for your task:
Use gemini-2.5-pro for complex data generation and larger datasets
Use gemini-2.5-flash for the latest performance and features
Use gemini-2.0-flash if you need consistency with previous results
Pro models work well for complex schemas and larger datasets
Set appropriate token limits: Both Flash models support up to 8,192 tokens. Make sure to set the max_tokens parameter accordingly.
Use detailed prompts: Gemini models respond well to specific guidance in prompts. Include details about the type of data you want to generate.
Monitor API usage: Keep track of your API usage to manage costs, especially when working with larger datasets.