Custom Generators¶
Custom generators in SYDA provide a powerful way to control and customize the data generation process. They allow you to define specific logic for generating or transforming field values beyond what the LLM can do alone.
When to Use Custom Generators¶
Custom generators are particularly useful when you need to:
- Use existing data to generate values
- Ensure consistency across related records for complex relationships
- Compute values based on other fields
- Implement complex business rules
- Access data from related tables
- Post-process LLM-generated content
Creating Custom Generators¶
A custom generator is simply a Python function that follows this signature:
def my_custom_generator(row, col_name=None, parent_dfs=None):
"""
Generate or transform a value for a specific field.
Args:
row: The current row being processed (as a pandas Series or dict-like object)
col_name: The name of the column being generated
parent_dfs: Dictionary of previously generated dataframes (schema name as key)
Returns:
The value to use for this field
"""
# Your custom logic here
return generated_value
Simple Example: Calculating Tax Amount¶
Here's a simple example that calculates tax amount based on subtotal and tax rate:
def calculate_tax(row, col_name=None):
"""Calculate tax amount based on subtotal and tax rate."""
return round(row["subtotal"] * row["tax_rate"] / 100, 2)
def calculate_total(row, col_name=None):
"""Calculate total from subtotal, tax, and discount."""
return round(row["subtotal"] + row["tax_amount"] - row["discount_amount"], 2)
Complex Example: Generating Receipt Items¶
Here's a more complex example from the retail example project that generates receipt items by accessing previously generated data from other tables:
def generate_receipt_items(row, col_name=None, parent_dfs=None):
"""Custom generator for the items field in receipts.
Uses the Product and Transaction data from parent_dfs to populate receipt items.
Args:
row: The current row being processed
col_name: The name of the column being generated
parent_dfs: Dictionary of previously generated dataframes (schema name as key)
"""
items = []
subtotal = 0.0
try:
# Get customer ID
customer_id = row.get('customer_id', None)
# Use the parent_dfs parameter which contains the generated data
if parent_dfs and 'Product' in parent_dfs and 'Transaction' in parent_dfs:
# Get Product and Transaction data from parent_dfs
products_df = parent_dfs['Product']
transactions_df = parent_dfs['Transaction']
# Filter transactions for this customer if possible
if customer_id and 'customer_id' in transactions_df.columns:
customer_transactions = transactions_df[transactions_df['customer_id'] == customer_id]
if len(customer_transactions) == 0:
# If no matching transactions, just take random ones
customer_transactions = transactions_df.sample(min(5, len(transactions_df)))
else:
# If no customer ID or column, just take random transactions
customer_transactions = transactions_df.sample(min(5, len(transactions_df)))
# Process each transaction to create a receipt item
for _, tx in customer_transactions.iterrows():
# Get product information
if 'product_id' in tx and tx['product_id'] is not None:
product_matches = products_df[products_df['id'] == tx['product_id']]
if len(product_matches) > 0:
product = product_matches.iloc[0]
# Extract product details
product_name = product['name']
sku = product['sku']
unit_price = float(product['price'])
quantity = int(tx['quantity'])
# Calculate item total
item_total = round(quantity * unit_price, 2)
# Add to items list
items.append({
"product_name": product_name,
"sku": sku,
"quantity": quantity,
"unit_price": unit_price,
"item_total": item_total
})
# Update subtotal
subtotal += item_total
# Update the row's subtotal
row["subtotal"] = round(subtotal, 2)
except Exception as e:
# Log any errors
print(f"Error generating receipt items: {str(e)}")
return items
Registering Custom Generators¶
Custom generators are registered by passing them to the generate_for_schemas
or generate_for_sqlalchemy_models
method via the custom_generators
parameter:
from syda.generate import SyntheticDataGenerator
from syda.schemas import ModelConfig
# Define custom generators for field calculations
def calculate_tax(row, col_name=None):
"""Calculate tax amount based on subtotal and tax rate."""
return round(row["subtotal"] * row["tax_rate"] / 100, 2)
def calculate_total(row, col_name=None):
"""Calculate total from subtotal, tax, and discount."""
return round(row["subtotal"] + row["tax_amount"] - row["discount_amount"], 2)
def generate_receipt_items(row, col_name=None, parent_dfs=None):
"""Generate items for a receipt based on products and transactions."""
# Implementation shown in earlier example
return items
# Register custom generators - schema name as key, then field name as sub-key
custom_generators = {
"Receipt": {
"items": generate_receipt_items,
"tax_amount": calculate_tax,
"total": calculate_total
}
}
# Initialize the generator
config = ModelConfig(
provider="anthropic",
model="claude-3-haiku-20240307",
max_tokens=8192
)
generator = SyntheticDataGenerator(model_config=config)
# Use custom generators during data generation
results = generator.generate_for_schemas(
schemas=schemas,
sample_sizes=sample_sizes,
prompts=prompts,
custom_generators=custom_generators,
output_dir=OUTPUT_DIR
)
Best Practices¶
- Keep Generators Simple: Each generator should have a single responsibility
- Handle Missing Data: Always check if required fields exist before using them
- Use Type Checking: Verify data types before performing operations on values
- Add Error Handling: Catch exceptions to prevent generator failures
- Document Your Generators: Include clear docstrings that explain functionality
- Test in Isolation: Test generators independently with sample data
- Avoid Side Effects: Unless needed (like the receipt example), generators shouldn't modify unrelated state
Examples¶
To see custom generator in action, explore SQLAlchemy Example and Yaml Example