Output Options¶

SYDA offers flexible options for handling the output of generated data, allowing you to save results in various formats and locations.

Return Types¶

By default, SYDA returns generated data as pandas DataFrames:

from syda import SyntheticDataGenerator, ModelConfig

config = ModelConfig(provider="anthropic", model_name="claude-3-5-haiku-20241022")
generator = SyntheticDataGenerator(model_config=config)

# Generate data
results = generator.generate_for_schemas(
    schemas={...},
    sample_sizes={"Customer": 10}
)

# Results is a dictionary of DataFrames
customer_df = results["Customer"]

# Work with the DataFrame
print(f"Generated {len(customer_df)} customer records")
print(customer_df.head())

The returned results dictionary maps table names to pandas DataFrames, making it easy to analyze, transform, or further process the generated data.

Saving to Files¶

You can save generated data to files by specifying an output directory:

results = generator.generate_for_schemas(
    schemas={...},
    sample_sizes={"Customer": 10, "Order": 25},
    output_dir="output/crm_data"
)

When you provide an output_dir:

SYDA creates the directory if it doesn't exist
Each table's data is saved as a CSV file (e.g., Customer.csv, Order.csv)
The results dictionary still contains the DataFrames for immediate use

Output Formats¶

By default, SYDA saves data in CSV format, but you can specify other formats using the output_formats parameter:

results = generator.generate_for_schemas(
    schemas={...},
    sample_sizes={"Customer": 10, "Order": 25},
    output_dir="output/crm_data",
    output_formats=["csv", "json"]
)

Supported output formats include:

csv: Standard comma-separated values format
json: JSON format with records orientation

Document Output¶

When generating unstructured documents alongside structured data, SYDA saves the documents in their specified formats:

schemas = {
    'Report': {
        '__template__': 'true',
        '__template_source__': 'templates/report.html',
        '__input_file_type__': 'html',
        '__output_file_type__': 'pdf',
        # ...other fields
    }
}

results = generator.generate_for_schemas(
    schemas=schemas,
    sample_sizes={"Report": 5},
    output_dir="output/reports"
)

This creates:

A Report subdirectory with the generated documents (e.g., Report_1.pdf, Report_2.pdf, etc.)

Output Directory Structure¶

When using both structured data and document generation, SYDA creates an organized directory structure:

output/
├── Customer.csv
├── Order.csv
├── OrderItem.csv
├── Invoice/
│   ├── Invoice_1.pdf
│   ├── Invoice_2.pdf
│   └── ...
└── Report/
    ├── Report_1.pdf
    ├── Report_2.pdf
    └── ...

This structure makes it easy to locate and manage both structured data and generated documents.

Working with Output Programmatically¶

After generation, you can further process or transform the output data:

# Generate data
results = generator.generate_for_schemas(
    schemas={...},
    sample_sizes={"Customer": 10, "Order": 25}
)

# Process Customer data
customers = results["Customer"]
vip_customers = customers[customers["annual_revenue"] > 1000000]

# Process Order data
orders = results["Order"]
recent_orders = orders[orders["order_date"] > "2023-01-01"]

# Join data for analysis
merged = orders.merge(customers, left_on="customer_id", right_on="id")

Best Practices¶

Use Descriptive Output Directories: Create meaningful directory names for your output
Choose Appropriate Formats: Select output formats based on your downstream needs
Process DataFrames Before Saving: Apply transformations before writing to disk when needed
Check Output Size: Be mindful of output size for large generations
Backup Results: Keep the returned DataFrames for immediate use even when saving to disk

Examples¶

Explore SQLAlchemy Example and Yaml Example