Output Options¶
SYDA offers flexible options for handling the output of generated data, allowing you to save results in various formats and locations.
Return Types¶
By default, SYDA returns generated data as pandas DataFrames:
from syda import SyntheticDataGenerator, ModelConfig
config = ModelConfig(provider="anthropic", model_name="claude-3-5-haiku-20241022")
generator = SyntheticDataGenerator(model_config=config)
# Generate data
results = generator.generate_for_schemas(
schemas={...},
sample_sizes={"Customer": 10}
)
# Results is a dictionary of DataFrames
customer_df = results["Customer"]
# Work with the DataFrame
print(f"Generated {len(customer_df)} customer records")
print(customer_df.head())
The returned results
dictionary maps table names to pandas DataFrames, making it easy to analyze, transform, or further process the generated data.
Saving to Files¶
You can save generated data to files by specifying an output directory:
results = generator.generate_for_schemas(
schemas={...},
sample_sizes={"Customer": 10, "Order": 25},
output_dir="output/crm_data"
)
When you provide an output_dir
:
- SYDA creates the directory if it doesn't exist
- Each table's data is saved as a CSV file (e.g.,
Customer.csv
,Order.csv
) - The results dictionary still contains the DataFrames for immediate use
Output Formats¶
By default, SYDA saves data in CSV format, but you can specify other formats using the output_formats
parameter:
results = generator.generate_for_schemas(
schemas={...},
sample_sizes={"Customer": 10, "Order": 25},
output_dir="output/crm_data",
output_formats=["csv", "json"]
)
Supported output formats include:
csv
: Standard comma-separated values formatjson
: JSON format with records orientation
Document Output¶
When generating unstructured documents alongside structured data, SYDA saves the documents in their specified formats:
schemas = {
'Report': {
'__template__': 'true',
'__template_source__': 'templates/report.html',
'__input_file_type__': 'html',
'__output_file_type__': 'pdf',
# ...other fields
}
}
results = generator.generate_for_schemas(
schemas=schemas,
sample_sizes={"Report": 5},
output_dir="output/reports"
)
This creates:
- A
Report
subdirectory with the generated documents (e.g.,Report_1.pdf
,Report_2.pdf
, etc.)
Output Directory Structure¶
When using both structured data and document generation, SYDA creates an organized directory structure:
output/
├── Customer.csv
├── Order.csv
├── OrderItem.csv
├── Invoice/
│ ├── Invoice_1.pdf
│ ├── Invoice_2.pdf
│ └── ...
└── Report/
├── Report_1.pdf
├── Report_2.pdf
└── ...
This structure makes it easy to locate and manage both structured data and generated documents.
Working with Output Programmatically¶
After generation, you can further process or transform the output data:
# Generate data
results = generator.generate_for_schemas(
schemas={...},
sample_sizes={"Customer": 10, "Order": 25}
)
# Process Customer data
customers = results["Customer"]
vip_customers = customers[customers["annual_revenue"] > 1000000]
# Process Order data
orders = results["Order"]
recent_orders = orders[orders["order_date"] > "2023-01-01"]
# Join data for analysis
merged = orders.merge(customers, left_on="customer_id", right_on="id")
Best Practices¶
- Use Descriptive Output Directories: Create meaningful directory names for your output
- Choose Appropriate Formats: Select output formats based on your downstream needs
- Process DataFrames Before Saving: Apply transformations before writing to disk when needed
- Check Output Size: Be mindful of output size for large generations
- Backup Results: Keep the returned DataFrames for immediate use even when saving to disk
Examples¶
Explore SQLAlchemy Example and Yaml Example