YAML Schemas with Mixed Content¶
Source code: examples/structured_and_unstructured/retail_yml/example_retail_schemas.py
This example demonstrates how to use YAML schemas to generate both structured data and unstructured document content for a retail application.
Overview¶
SYDA can generate both structured tabular data and unstructured document content like PDFs in a single workflow. This approach is useful when you need to maintain consistency between your structured database records and free-form documents that reference them.
In this retail example, we generate: - Structured data: Product categories, products, customers, and transactions - Unstructured content: Receipt documents (PDFs) that include items from transactions
Schema Definition¶
The retail example uses YAML schemas to define both structured data models and document templates.
Structured Data Schemas¶
Here's an example of the structured YAML schemas:
# category.yml
__table_description__: Product categories in the retail system
id:
type: number
description: Unique identifier for the category
primary_key: true
name:
type: text
description: Category name
nullable: false
parent_id:
type: number
description: Reference to parent category (if any, for hierarchical categories)
nullable: true
description:
type: text
description: Category description
# product.yml
__table_description__: Products available for purchase in the retail system
id:
type: number
description: Unique identifier for the product
primary_key: true
name:
type: text
description: Product name
nullable: false
sku:
type: text
description: Stock keeping unit (unique product code)
nullable: false
category_id:
type: number
description: Reference to product category
nullable: false
references: Category.id
price:
type: number
description: Product price in USD
nullable: false
description:
type: text
description: Detailed product description
Document Template Schema¶
For generating unstructured documents (like receipts), we define a special schema with template processing capabilities:
# receipt.yml
__table_description__: Retail receipts with store details and itemized purchases
__template__: true
__template_source__: templates/receipt.html
__input_file_type__: html
__output_file_type__: pdf
__depends_on__:
- Customer
- Transaction
- Product
id:
type: number
description: Unique identifier for the receipt
primary_key: true
store_name:
type: text
description: Name of the retail store
nullable: false
store_address:
type: text
description: Full address of the retail store
nullable: false
transaction_date:
type: date
description: Date when the transaction occurred
nullable: false
customer_id:
type: number
description: Reference to customer who made the purchase
references: Customer.id
nullable: true
cashier:
type: text
description: Name of the cashier who processed the transaction
nullable: false
items:
type: array
description: List of items purchased in this transaction
item_fields:
product_name:
type: text
description: Name of the product purchased
sku:
type: text
description: Product SKU/code
quantity:
type: number
description: Quantity purchased
unit_price:
type: number
description: Price per unit
item_total:
type: number
description: Total price for this item (quantity * unit_price)
subtotal:
type: number
description: Sum of all item totals before tax and discounts
nullable: false
tax_rate:
type: number
description: Tax rate applied as percentage (e.g. 7.5 for 7.5%)
nullable: false
tax_amount:
type: number
description: Calculated tax amount
nullable: false
discount_amount:
type: number
description: Any discounts applied to the transaction
nullable: false
default: 0.0
total:
type: number
description: Final total including tax and discounts
nullable: false
payment_method:
type: text
description: Method of payment (Credit, Cash, etc)
nullable: false
notes:
type: text
description: Any additional notes for the receipt
nullable: true
Foreign Key Handling¶
YAML schemas handle foreign keys through explicit references:
category_id:
type: number
description: Reference to product category
nullable: false
references: Category.id
Additionally, for template generation, you can specify schema dependencies using the __depends_on__
property:
This ensures that all required data is generated before processing templates.
Code Example¶
Here's how to use YAML schemas to generate mixed structured and unstructured data:
import os
import pandas as pd
from syda.generate import SyntheticDataGenerator
from syda.schemas import ModelConfig
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# Define output directory
OUTPUT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "output")
os.makedirs(OUTPUT_DIR, exist_ok=True)
def main():
"""Run the retail sector example."""
# Initialize the generator with Claude
config = ModelConfig(provider="anthropic", model="claude-3-haiku-20240307")
generator = SyntheticDataGenerator(model_config=config)
# Define schema directory paths
schema_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "schemas")
# Define paths to schema files
schemas = {
"Category": os.path.join(schema_dir, "category.yml"),
"Customer": os.path.join(schema_dir, "customer.yml"),
"Product": os.path.join(schema_dir, "product.yml"),
"Transaction": os.path.join(schema_dir, "transaction.yml"),
"Receipt": os.path.join(schema_dir, "receipt.yml")
}
# Define custom prompts for each schema (optional)
prompts = {
"Category": "Generate retail product categories with hierarchical structure.",
"Product": "Generate retail products with names, SKUs, prices, and descriptions. Ensure a good variety of prices and categories.",
"Customer": "Generate customer records for a retail business with realistic names, addresses, and email patterns.",
"Transaction": "Generate retail transactions with realistic purchase patterns, payment methods, and item counts.",
"Receipt": "Generate data for retail receipts including store details, transaction information, and itemized purchases."
}
# Define sample sizes for each schema
sample_sizes = {
"Category": 5,
"Product": 25,
"Customer": 10,
"Transaction": 20,
"Receipt": 5
}
# Helper functions for calculated fields
def calculate_tax(row, col_name=None):
"""Calculate tax amount based on subtotal and tax rate."""
return round(row["subtotal"] * row["tax_rate"] / 100, 2)
def calculate_total(row, col_name=None):
"""Calculate total from subtotal, tax, and discount."""
return round(row["subtotal"] + row["tax_amount"] - row["discount_amount"], 2)
def generate_receipt_items(row, col_name=None, parent_dfs=None):
"""Custom generator for the items field in receipts.
Uses the Product and Transaction data from parent_dfs to populate receipt items.
Args:
row: The current row being processed
col_name: The name of the column being generated
parent_dfs: Dictionary of previously generated dataframes (schema name as key)
"""
items = []
subtotal = 0.0
try:
# Get customer ID
customer_id = row.get('customer_id', None)
# Use the parent_dfs parameter which contains the generated data
if parent_dfs and 'Product' in parent_dfs and 'Transaction' in parent_dfs:
# Get Product data from parent_dfs
products_df = parent_dfs['Product']
transactions_df = parent_dfs['Transaction']
else:
print(" - Required parent dataframes not available")
return []
# Filter transactions for this customer if possible
if customer_id and 'customer_id' in transactions_df.columns:
customer_transactions = transactions_df[transactions_df['customer_id'] == customer_id]
if len(customer_transactions) == 0:
print(f" - No transactions found for customer ID {customer_id}, using random transactions")
# If no matching transactions, just take random ones
customer_transactions = transactions_df.sample(min(5, len(transactions_df)))
else:
print(f" - Found {len(customer_transactions)} transactions for customer ID {customer_id}")
else:
print(f" - No customer ID or customer_id column, using random transactions")
# If no customer ID or column, just take random transactions
customer_transactions = transactions_df.sample(min(5, len(transactions_df)))
# Process each transaction to create a receipt item
for _, tx in customer_transactions.iterrows():
# Get product information
if 'product_id' in tx and tx['product_id'] is not None:
product_matches = products_df[products_df['id'] == tx['product_id']]
if len(product_matches) > 0:
product = product_matches.iloc[0]
# Extract product details
product_name = product['name']
sku = product['sku']
unit_price = float(product['price'])
quantity = int(tx['quantity'])
# Calculate item total
item_total = round(quantity * unit_price, 2)
# Add to items list
items.append({
"product_name": product_name,
"sku": sku,
"quantity": quantity,
"unit_price": unit_price,
"item_total": item_total
})
# Update subtotal
subtotal += item_total
# Update the row's subtotal
row["subtotal"] = round(subtotal, 2)
# If no items, set subtotal to 0
if not items:
row["subtotal"] = 0.0
return items
except Exception as e:
print(f"Error generating receipt items: {str(e)}")
return []
# Custom generators dictionary
custom_generators = {
"Receipt": {
"items": generate_receipt_items,
"tax_amount": calculate_tax,
"total": calculate_total
}
}
# Generate data for all schemas in a single step
# The __depends_on__ mechanism ensures the correct generation order
results = generator.generate_for_schemas(
schemas=schemas,
prompts=prompts,
sample_sizes=sample_sizes,
output_dir=OUTPUT_DIR,
custom_generators=custom_generators
)
Key Features¶
- Mixed Content Generation: Generate both structured data and documents in one workflow
- Template Processing: Convert structured data into formatted documents (HTML → PDF)
- Cross-Reference Consistency: Ensure generated documents reference valid structured data
- Custom Generators: Define functions that calculate values or extract data from other schemas
- Dependency Management: Define which schemas must be generated before others
Document Template Files¶
The __template__
property in the Receipt schema references an HTML template file:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>{{store_name}} - Receipt</title>
<style>
body {
font-family: 'Arial', sans-serif;
line-height: 1.5;
max-width: 800px;
margin: 0 auto;
padding: 20px;
}
.header {
text-align: center;
margin-bottom: 20px;
border-bottom: 2px solid #333;
padding-bottom: 10px;
}
.receipt-details {
margin-bottom: 20px;
}
.items-table {
width: 100%;
border-collapse: collapse;
margin-top: 20px;
margin-bottom: 20px;
}
.items-table th, .items-table td {
border-bottom: 1px solid #ddd;
padding: 10px;
text-align: left;
}
.items-table th {
background-color: #f5f5f5;
}
.amount-section {
width: 50%;
float: right;
margin-top: 10px;
}
.amount-row {
display: flex;
justify-content: space-between;
margin-bottom: 5px;
}
.final-total {
font-weight: bold;
font-size: 1.2em;
margin-top: 10px;
padding-top: 10px;
border-top: 2px solid #333;
}
.footer {
margin-top: 30px;
text-align: center;
font-size: 0.9em;
color: #666;
}
</style>
</head>
<body>
<div class="header">
<h1>{{store_name}}</h1>
<p>{{store_address}}</p>
</div>
<div class="receipt-details">
<p><strong>Receipt #:</strong> {{id}}</p>
<p><strong>Date:</strong> {{transaction_date}}</p>
<p><strong>Cashier:</strong> {{cashier}}</p>
{% if customer_id %}
<p><strong>Customer ID:</strong> {{customer_id}}</p>
{% endif %}
</div>
<table class="items-table">
<thead>
<tr>
<th>Item</th>
<th>SKU</th>
<th>Quantity</th>
<th>Unit Price</th>
<th>Total</th>
</tr>
</thead>
<tbody>
{% for item in items %}
<tr>
<td>{{item.product_name}}</td>
<td>{{item.sku}}</td>
<td>{{item.quantity}}</td>
<td>${{item.unit_price}}</td>
<td>${{item.item_total}}</td>
</tr>
{% endfor %}
</tbody>
</table>
<div class="amount-section">
<div class="amount-row">
<span>Subtotal:</span>
<span>${{subtotal}}</span>
</div>
<div class="amount-row">
<span>Tax ({{tax_rate}}%):</span>
<span>${{tax_amount}}</span>
</div>
{% if discount_amount > 0 %}
<div class="amount-row">
<span>Discount:</span>
<span>-${{discount_amount}}</span>
</div>
{% endif %}
<div class="amount-row final-total">
<span>Total:</span>
<span>${{total}}</span>
</div>
</div>
<div style="clear: both;"></div>
<div class="payment-section">
<p><strong>Payment Method:</strong> {{payment_method}}</p>
{% if notes %}
<p><strong>Notes:</strong> {{notes}}</p>
{% endif %}
</div>
<div class="footer">
<p>Thank you for shopping at {{store_name}}!</p>
<p>Please keep this receipt for your records.</p>
</div>
</body>
</html>
Best Practices¶
-
Define Dependencies: Use
__depends_on__
to specify which schemas must be generated first -
Use Custom Generators: Create custom generators for complex logic like populating document items
-
Access Parent Data: Use the
parent_dfs
parameter in custom generators to access previously generated data -
Calculate Derived Values: Use simple functions for calculated fields like totals and taxes
-
Design Clean Templates: Create well-structured templates with appropriate styling and conditional sections
Sample Outputs¶
You can view sample outputs generated using these YAML schemas and templates here: