Deployment Process
YeboLearn's deployment process ensures zero-downtime releases, rapid rollback capability, and comprehensive monitoring. We deploy multiple times daily to dev, weekly to staging, and bi-weekly to production.
Deployment Architecture
Infrastructure Overview
Google Cloud Platform
├── Cloud Run (Container Platform)
│ ├── Production Service (api.yebolearn.app)
│ ├── Staging Service (staging.yebolearn.app)
│ └── Dev Service (dev-api.yebolearn.app)
├── Cloud SQL (PostgreSQL 15)
│ ├── Production Database
│ ├── Staging Database
│ └── Dev Database
├── Artifact Registry (Docker Images)
├── Cloud Storage (Static Assets, Backups)
├── Cloud Load Balancer
└── Cloud Logging & MonitoringContainer Strategy
Docker Multi-Stage Build:
# Build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
# Production stage
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
ENV NODE_ENV=production
EXPOSE 8080
CMD ["node", "dist/server.js"]Image Optimization:
- Multi-stage builds (reduces size by 60%)
- Alpine base image (smaller footprint)
- Layer caching (faster builds)
- Security scanning before deployment
Environment Strategy
Environment Configuration
| Environment | URL | Database | Purpose | Deploy Trigger |
|---|---|---|---|---|
| Development | dev-api.yebolearn.app | Dev DB (small instance) | Active development, testing | Auto on merge to dev |
| Staging | staging.yebolearn.app | Staging DB (production replica) | Pre-prod validation, QA | Weekly from dev |
| Production | api.yebolearn.app | Production DB (high availability) | Live users | Bi-weekly release |
Environment Variables
Managed via Google Secret Manager:
# Development
NODE_ENV=development
DATABASE_URL=postgresql://dev_db_connection
GEMINI_API_KEY=dev_key_with_limits
MPESA_CONSUMER_KEY=test_key
LOG_LEVEL=debug
# Staging
NODE_ENV=staging
DATABASE_URL=postgresql://staging_db_connection
GEMINI_API_KEY=staging_key_production_like
MPESA_CONSUMER_KEY=sandbox_key
LOG_LEVEL=info
# Production
NODE_ENV=production
DATABASE_URL=postgresql://prod_db_connection
GEMINI_API_KEY=production_key
MPESA_CONSUMER_KEY=production_key
LOG_LEVEL=warnEnvironment Isolation
Development:
- Relaxed rate limits
- Debug logging enabled
- Test payment credentials
- Mock external services (when needed)
- Sample data in database
Staging:
- Production-like configuration
- Real integrations in test mode
- Anonymized production data copy
- Performance monitoring
- QA and stakeholder access
Production:
- Optimized for performance
- Strict rate limits
- Minimal logging (errors/warnings)
- Real payment processing
- High availability configuration
Deployment Workflows
Development Deployment
Trigger: Merge to dev branch
Process:
# .github/workflows/deploy-dev.yml
name: Deploy to Development
on:
push:
branches: [dev]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Tests
run: npm test
- name: Build Docker Image
run: |
docker build -t gcr.io/yebolearn/api:dev-${{ github.sha }} .
docker tag gcr.io/yebolearn/api:dev-${{ github.sha }} gcr.io/yebolearn/api:dev-latest
- name: Push to Artifact Registry
run: |
docker push gcr.io/yebolearn/api:dev-${{ github.sha }}
docker push gcr.io/yebolearn/api:dev-latest
- name: Deploy to Cloud Run
run: |
gcloud run deploy yebolearn-dev \
--image gcr.io/yebolearn/api:dev-${{ github.sha }} \
--platform managed \
--region africa-south1 \
--allow-unauthenticated
- name: Run Smoke Tests
run: npm run test:smoke -- --env=dev
- name: Notify Team
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d "{'text': 'Dev deployment successful: ${{ github.sha }}'}"Timeline:
- Tests: 3 minutes
- Build: 2 minutes
- Deploy: 1 minute
- Smoke tests: 30 seconds
- Total: ~7 minutes
Staging Deployment
Trigger: Manual or weekly schedule
Process:
# Manual staging deployment
git checkout staging
git merge dev
git push origin staging
# CI/CD takes over
# 1. Run full test suite
# 2. Build and tag image
# 3. Deploy to staging environment
# 4. Run integration tests
# 5. Notify QA teamPre-Deployment Checklist:
- [ ] All dev tests passing
- [ ] Features validated in dev
- [ ] Database migrations prepared
- [ ] QA team notified
- [ ] Stakeholder demo scheduled
Timeline:
- Tests: 5 minutes
- Build: 2 minutes
- Database migration: 1-5 minutes
- Deploy: 2 minutes
- Integration tests: 3 minutes
- Total: ~15 minutes
Production Deployment
Trigger: Bi-weekly release (Thursday 10 AM)
Process:
1. Pre-Deployment (Tuesday-Wednesday)
# Create release branch
git checkout -b release/v2.5.0 staging
# Final testing
npm run test:all
npm run test:e2e
# Generate changelog
npm run changelog
# Update version
npm version minor -m "Release v2.5.0: AI Essay Grading"2. Deployment Day (Thursday 10 AM)
# Backup production database
gcloud sql backups create \
--instance=yebolearn-prod-db \
--description="Pre-deployment backup v2.5.0"
# Tag release
git tag -a v2.5.0 -m "Release v2.5.0"
git push origin v2.5.0
# Merge to main
git checkout main
git merge release/v2.5.0
git push origin main
# GitHub Actions triggered automatically3. Blue-Green Deployment
# Automatic via CI/CD
steps:
- name: Deploy New Version (Green)
run: |
gcloud run deploy yebolearn-api-green \
--image gcr.io/yebolearn/api:v2.5.0 \
--no-traffic
- name: Health Check Green
run: |
curl https://green.yebolearn.app/health
npm run test:smoke -- --env=green
- name: Run Database Migrations
run: |
npm run migrate:prod
- name: Switch Traffic to Green
run: |
gcloud run services update-traffic yebolearn-api \
--to-revisions=yebolearn-api-green=100
- name: Monitor for 10 Minutes
run: |
sleep 600
# Check error rates, response times, etc.
- name: Decommission Blue (if successful)
run: |
gcloud run revisions delete yebolearn-api-blue4. Post-Deployment
# Monitor critical metrics
# - Error rate
# - Response time
# - Database performance
# - Payment success rate
# Verify key user flows
npm run test:smoke:critical
# Update status page
# Notify team of successful deploymentTimeline:
- Backup: 5 minutes
- Build & test: 8 minutes
- Deploy green: 3 minutes
- Migrations: 2-10 minutes
- Traffic switch: 1 minute
- Monitoring period: 10 minutes
- Total: ~30 minutes
Database Migrations
Migration Strategy
Development:
# Create migration
npm run migrate:create add_ai_essay_grading
# Apply migration
npm run migrate:dev
# Test rollback
npm run migrate:rollback:devProduction:
# Migrations run automatically during deployment
# But tested thoroughly in staging first
# Zero-downtime patterns:
# 1. Add new column (nullable)
# 2. Deploy code that writes to both old and new
# 3. Backfill data
# 4. Deploy code that reads from new
# 5. Remove old column (next release)Migration Best Practices:
- Always reversible (down migration)
- Test on staging first
- Backup before running
- Monitor performance impact
- Use indexes for large tables
- Avoid blocking operations in production
Example Migration
// migrations/20251122_add_essay_grading.ts
import { Knex } from 'knex';
export async function up(knex: Knex): Promise<void> {
await knex.schema.createTable('essay_submissions', (table) => {
table.uuid('id').primary().defaultTo(knex.raw('gen_random_uuid()'));
table.uuid('student_id').notNullable().references('id').inTable('students');
table.text('content').notNullable();
table.jsonb('ai_feedback').nullable();
table.integer('score').nullable();
table.timestamps(true, true);
table.index('student_id');
table.index('created_at');
});
}
export async function down(knex: Knex): Promise<void> {
await knex.schema.dropTable('essay_submissions');
}Rollback Procedures
Automatic Rollback
Triggers:
- Error rate >5% for 2 minutes
- Response time >2s (p95) for 5 minutes
- Health check failures
- Critical API endpoints down
Process:
# Automatic rollback in CI/CD
- name: Monitor Deployment
run: |
# Check error rate every 30 seconds for 10 minutes
for i in {1..20}; do
error_rate=$(curl -s https://api.yebolearn.app/metrics/errors)
if [ $error_rate -gt 5 ]; then
echo "Error rate too high, rolling back"
gcloud run services update-traffic yebolearn-api \
--to-revisions=yebolearn-api-blue=100
exit 1
fi
sleep 30
doneManual Rollback
Quick Rollback (Revert Traffic):
# List recent revisions
gcloud run revisions list --service=yebolearn-api
# Switch traffic back to previous version
gcloud run services update-traffic yebolearn-api \
--to-revisions=yebolearn-api-v2.4.9=100
# Verify rollback
curl https://api.yebolearn.app/health
npm run test:smoke:criticalTimeline: 2-3 minutes
Database Rollback (If Needed):
# Only if migration is problematic
# Use with extreme caution
# Restore from backup
gcloud sql backups restore <backup-id> \
--backup-instance=yebolearn-prod-db
# Or run down migration
npm run migrate:rollback:prod
# Redeploy previous versionTimeline: 10-30 minutes
Rollback Decision Tree
Is production broken?
├─ Yes: Critical issue (payments, data loss, security)
│ └─> Immediate rollback (2 minutes)
├─ Partial: Some users affected, workaround exists
│ └─> Evaluate fix time vs rollback
│ ├─ Fix <30 min → Hotfix
│ └─ Fix >30 min → Rollback
└─ No: Minor issue, low impact
└─> Schedule fix for next releaseMonitoring and Alerts
Health Checks
Endpoint: /health
export async function healthCheck(): Promise<HealthStatus> {
const checks = await Promise.all([
checkDatabase(),
checkRedisCache(),
checkGeminiAPI(),
checkPaymentGateway(),
]);
const healthy = checks.every(c => c.status === 'healthy');
return {
status: healthy ? 'healthy' : 'degraded',
timestamp: new Date(),
checks: {
database: checks[0],
cache: checks[1],
gemini: checks[2],
payments: checks[3],
},
version: process.env.APP_VERSION,
};
}Response:
{
"status": "healthy",
"timestamp": "2025-11-22T10:30:00Z",
"checks": {
"database": { "status": "healthy", "latency": "12ms" },
"cache": { "status": "healthy", "latency": "2ms" },
"gemini": { "status": "healthy", "latency": "145ms" },
"payments": { "status": "healthy", "latency": "234ms" }
},
"version": "v2.5.0"
}Metrics Tracking
Key Metrics:
// Application Performance
- Request rate (requests/second)
- Response time (p50, p95, p99)
- Error rate (%)
- Active users (concurrent)
// Business Metrics
- Quiz completions/hour
- AI features usage
- Payment success rate
- Course enrollments
// Infrastructure
- CPU utilization (%)
- Memory usage (%)
- Database connections
- Container restartsMonitoring Stack:
Metrics Collection: Prometheus
Visualization: Grafana
Logging: Google Cloud Logging
Tracing: Google Cloud Trace
Error Tracking: Sentry
Uptime Monitoring: UptimeRobot
Alerting: PagerDutyAlert Configuration
Critical Alerts (Page On-Call):
- name: API Down
condition: uptime < 99% for 2 minutes
severity: critical
notify: pagerduty
- name: High Error Rate
condition: error_rate > 5% for 3 minutes
severity: critical
notify: pagerduty
- name: Payment Failures
condition: payment_failure_rate > 10% for 5 minutes
severity: critical
notify: pagerduty
- name: Database Connection Pool Exhausted
condition: db_connections > 90% for 2 minutes
severity: critical
notify: pagerdutyWarning Alerts (Slack):
- name: Elevated Response Time
condition: p95_response_time > 1s for 10 minutes
severity: warning
notify: slack
- name: Increased Error Rate
condition: error_rate > 2% for 10 minutes
severity: warning
notify: slack
- name: High Memory Usage
condition: memory_usage > 80% for 15 minutes
severity: warning
notify: slackLogging Strategy
Log Levels:
// Production: WARN and ERROR only
logger.error('Payment processing failed', {
userId,
transactionId,
error: err.message
});
logger.warn('Gemini API rate limit approaching', {
currentUsage: 850,
limit: 1000
});
// Development/Staging: Include INFO and DEBUG
logger.info('Quiz generated successfully', {
quizId,
questionCount,
generationTime
});
logger.debug('Database query executed', {
query,
duration,
rowCount
});Structured Logging:
import { logger } from './logger';
// Good: Structured with context
logger.error('Payment failed', {
event: 'payment_failure',
userId: 'user-123',
amount: 500,
provider: 'mpesa',
errorCode: 'TIMEOUT',
transactionId: 'txn-456',
timestamp: new Date(),
});
// Bad: Unstructured string
logger.error('Payment failed for user-123 amount 500');Performance Optimization
CDN and Caching
Static Assets:
- Served from Google Cloud CDN
- Cache-Control headers configured
- Versioned filenames for cache busting
- Compressed (gzip/brotli)
API Caching:
// Redis for frequently accessed data
import { redis } from './cache';
export async function getQuiz(quizId: string) {
// Check cache first
const cached = await redis.get(`quiz:${quizId}`);
if (cached) return JSON.parse(cached);
// Fetch from database
const quiz = await db.quiz.findUnique({ where: { id: quizId } });
// Cache for 1 hour
await redis.set(`quiz:${quizId}`, JSON.stringify(quiz), 'EX', 3600);
return quiz;
}Database Optimization
Connection Pooling:
// Prisma configuration
const prisma = new PrismaClient({
datasources: {
db: {
url: process.env.DATABASE_URL,
},
},
// Connection pool for Cloud Run
connection_limit: 10, // Conservative for serverless
});Query Optimization:
- Indexes on frequently queried columns
- Avoid N+1 queries (use includes/joins)
- Pagination for large result sets
- Database query monitoring
Container Optimization
Resource Limits:
# Cloud Run configuration
resources:
limits:
cpu: "2"
memory: "1Gi"
requests:
cpu: "1"
memory: "512Mi"
autoscaling:
minInstances: 1 # Always one warm instance
maxInstances: 100 # Scale up to handle load
targetCPU: 70 # Scale when CPU > 70%
targetMemory: 80 # Scale when memory > 80%Deployment Checklist
Pre-Deployment
- [ ] All tests passing (unit, integration, E2E)
- [ ] Code reviewed and approved
- [ ] Database migrations tested in staging
- [ ] Feature flags configured
- [ ] Monitoring dashboards prepared
- [ ] Rollback plan documented
- [ ] On-call engineer identified
- [ ] Stakeholders notified
During Deployment
- [ ] Backup database
- [ ] Deploy to green environment
- [ ] Run health checks
- [ ] Execute migrations
- [ ] Switch traffic gradually
- [ ] Monitor error rates
- [ ] Verify critical flows
- [ ] Check business metrics
Post-Deployment
- [ ] Monitor for 30 minutes
- [ ] Run smoke tests
- [ ] Check logs for errors
- [ ] Verify integrations working
- [ ] Update status page
- [ ] Document any issues
- [ ] Notify team of completion
- [ ] Schedule retrospective (if issues)
Disaster Recovery
Backup Strategy
Database Backups:
- Automated daily backups (retained 30 days)
- Pre-deployment backups (retained 7 days)
- Weekly full backups (retained 90 days)
- Point-in-time recovery (7 days)
Restore Process:
# List available backups
gcloud sql backups list --instance=yebolearn-prod-db
# Restore from backup
gcloud sql backups restore <backup-id> \
--backup-instance=yebolearn-prod-db \
--backup-project=yebolearn-prod
# Verify data integrity
npm run db:verifyApplication State:
- Docker images retained indefinitely
- Git tags for all releases
- Configuration in version control
- Secrets in Secret Manager (versioned)
Incident Response
Severity Levels:
P0 (Critical): Complete service outage, data loss risk
- Response time: Immediate
- Escalation: Page on-call + management
- Communication: Status page + email users
P1 (High): Major feature broken, payment issues
- Response time: 15 minutes
- Escalation: On-call engineer
- Communication: Status page update
P2 (Medium): Minor feature degraded
- Response time: 2 hours
- Escalation: Team Slack
- Communication: Internal only
P3 (Low): Cosmetic issues, minor bugs
- Response time: Next business day
- Escalation: Linear ticket
- Communication: None required
Related Documentation
- Development Workflow - Overall workflow
- Git Conventions - Release tagging
- Monitoring - Detailed monitoring setup
- Infrastructure - Architecture details