Skip to content

Monitoring and Observability

Comprehensive monitoring enables proactive incident response and data-driven optimization. YeboLearn maintains 99.9% uptime through robust observability practices.

Monitoring Philosophy

Three Pillars of Observability

1. Logs (What happened)

  • Structured event records
  • Debugging and auditing
  • Historical analysis

2. Metrics (How much/many)

  • Time-series numerical data
  • Performance trends
  • Alerting thresholds

3. Traces (Request flow)

  • End-to-end request tracking
  • Latency breakdown
  • Dependency mapping

Monitoring Goals

Proactive Over Reactive:

  • Detect issues before users report them
  • Alert on trends, not just failures
  • Prevent incidents through early warnings

Actionable Over Comprehensive:

  • Monitor what matters
  • Every alert must be actionable
  • Reduce noise, increase signal

Fast Mean Time to Detection (MTTD):

  • Target: <2 minutes
  • Current: 2 minutes
  • Real-time monitoring and alerting

Fast Mean Time to Resolution (MTTR):

  • Target: <1 hour
  • Current: 25 minutes
  • Quick access to relevant data

Monitoring Stack

Infrastructure Monitoring

Google Cloud Monitoring:

yaml
Platform Metrics:
- CPU utilization (%)
- Memory usage (%)
- Disk I/O
- Network traffic

Cloud Run:
- Container instances
- Request count
- Request latency
- Cold starts
- Error rate

Cloud SQL:
- CPU/Memory usage
- Connections (active/max)
- Query performance
- Replication lag (HA mode)
- Storage usage

Dashboard Example:

Infrastructure Health Dashboard
├─ Cloud Run
│   ├─ Active instances: 2 (avg), 8 (max)
│   ├─ CPU: 35% (avg), 78% (peak)
│   ├─ Memory: 68% (avg), 85% (peak)
│   └─ Cold starts: <1% of requests
├─ Cloud SQL
│   ├─ CPU: 45% (avg), 82% (peak)
│   ├─ Memory: 62%
│   ├─ Connections: 12/25
│   ├─ Query time: 35ms (avg)
│   └─ Storage: 38GB/100GB
└─ Network
    ├─ Ingress: 45 MB/s (avg)
    ├─ Egress: 32 MB/s (avg)
    └─ Latency: 12ms (avg)

Application Monitoring

Custom Metrics (Prometheus):

typescript
// Metrics instrumentation
import { Counter, Histogram, Gauge } from 'prom-client';

// Request counter
export const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

// Request duration
export const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.5, 1, 2, 5],
});

// Active users
export const activeUsers = new Gauge({
  name: 'active_users_total',
  help: 'Currently active users',
});

// Usage in middleware
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequests.inc({
      method: req.method,
      route: req.route?.path || 'unknown',
      status: res.statusCode,
    });

    httpDuration.observe(
      {
        method: req.method,
        route: req.route?.path || 'unknown',
        status: res.statusCode,
      },
      duration
    );
  });

  next();
});

Business Metrics:

typescript
// Quiz completion tracking
export const quizCompletions = new Counter({
  name: 'quiz_completions_total',
  help: 'Total quiz completions',
  labelNames: ['subject', 'difficulty'],
});

// AI feature usage
export const aiFeatureUsage = new Counter({
  name: 'ai_feature_usage_total',
  help: 'AI feature usage count',
  labelNames: ['feature'], // quiz_gen, essay_grade, etc
});

// Payment transactions
export const paymentTransactions = new Counter({
  name: 'payment_transactions_total',
  help: 'Payment transactions',
  labelNames: ['provider', 'status'], // mpesa/stripe, success/failed
});

// Usage
quizCompletions.inc({ subject: 'mathematics', difficulty: 'medium' });
aiFeatureUsage.inc({ feature: 'quiz_generation' });
paymentTransactions.inc({ provider: 'mpesa', status: 'success' });

Error Tracking

Sentry Integration:

typescript
// Initialize Sentry
import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1, // 10% of requests
  integrations: [
    new Sentry.Integrations.Http({ tracing: true }),
    new Sentry.Integrations.Express({ app }),
  ],
});

// Capture errors
app.use(Sentry.Handlers.requestHandler());
app.use(Sentry.Handlers.tracingHandler());

// Error handler
app.use((err, req, res, next) => {
  // Log to Sentry
  Sentry.captureException(err, {
    tags: {
      route: req.route?.path,
      method: req.method,
    },
    user: {
      id: req.user?.id,
      email: req.user?.email,
    },
    extra: {
      body: req.body,
      params: req.params,
    },
  });

  // Send response
  res.status(500).json({ error: 'Internal server error' });
});

Error Categories:

Sentry Dashboard Organization:
├─ By Environment
│   ├─ Production (high priority)
│   ├─ Staging (medium priority)
│   └─ Development (low priority)
├─ By Severity
│   ├─ Fatal (immediate attention)
│   ├─ Error (high priority)
│   ├─ Warning (monitor)
│   └─ Info (log only)
└─ By Component
    ├─ API errors
    ├─ Database errors
    ├─ AI integration errors
    ├─ Payment errors
    └─ Frontend errors

Performance Monitoring

Real User Monitoring (RUM):

typescript
// Frontend performance tracking
export function trackPagePerformance() {
  if (typeof window === 'undefined') return;

  window.addEventListener('load', () => {
    const perfData = window.performance.timing;
    const pageLoadTime = perfData.loadEventEnd - perfData.navigationStart;
    const domReadyTime = perfData.domContentLoadedEventEnd - perfData.navigationStart;
    const ttfb = perfData.responseStart - perfData.requestStart;

    // Send to analytics
    analytics.track('page_performance', {
      page: window.location.pathname,
      loadTime: pageLoadTime,
      domReady: domReadyTime,
      ttfb,
      connection: navigator.connection?.effectiveType,
      deviceMemory: navigator.deviceMemory,
    });

    // Alert if slow
    if (pageLoadTime > 3000) {
      console.warn('Slow page load:', pageLoadTime);
    }
  });
}

// Core Web Vitals
import { getCLS, getFID, getLCP } from 'web-vitals';

function sendToAnalytics(metric) {
  analytics.track('web_vital', {
    name: metric.name,
    value: metric.value,
    rating: metric.rating,
    page: window.location.pathname,
  });
}

getCLS(sendToAnalytics);
getFID(sendToAnalytics);
getLCP(sendToAnalytics);

API Performance Monitoring:

typescript
// Track slow database queries
import { PrismaClient } from '@prisma/client';

const prisma = new PrismaClient({
  log: [
    {
      emit: 'event',
      level: 'query',
    },
  ],
});

prisma.$on('query', (e) => {
  if (e.duration > 100) {
    // Log slow queries (>100ms)
    logger.warn('Slow query detected', {
      query: e.query,
      duration: e.duration,
      params: e.params,
    });

    // Track metric
    slowQueries.inc({
      model: extractModel(e.query),
    });
  }
});

// Track AI API latency
export async function callGeminiAPI(prompt: string) {
  const start = Date.now();

  try {
    const response = await geminiClient.generateContent(prompt);
    const duration = Date.now() - start;

    // Track metric
    aiApiDuration.observe({ status: 'success' }, duration / 1000);

    return response;
  } catch (error) {
    const duration = Date.now() - start;
    aiApiDuration.observe({ status: 'error' }, duration / 1000);
    throw error;
  }
}

Log Aggregation

Structured Logging:

typescript
// Winston logger configuration
import winston from 'winston';

export const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'yebolearn-api',
    environment: process.env.NODE_ENV,
  },
  transports: [
    // Console for local development
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      ),
    }),

    // Cloud Logging for production
    new winston.transports.Stream({
      stream: process.stdout,
      format: winston.format.json(),
    }),
  ],
});

// Usage with context
logger.info('Quiz completed', {
  userId: 'user-123',
  quizId: 'quiz-456',
  score: 85,
  duration: 1200,
});

logger.error('Payment failed', {
  userId: 'user-123',
  amount: 500,
  provider: 'mpesa',
  error: 'Insufficient funds',
  transactionId: 'txn-789',
});

logger.warn('High API latency', {
  endpoint: '/api/student/dashboard',
  duration: 850,
  threshold: 500,
});

Log Levels:

Production (LOG_LEVEL=warn):
ERROR: Critical errors, failures
WARN: Potential issues, degraded performance
(INFO and DEBUG disabled in production)

Staging (LOG_LEVEL=info):
ERROR: Critical errors
WARN: Warnings
INFO: Important events (user actions, API calls)
(DEBUG disabled)

Development (LOG_LEVEL=debug):
ERROR: All errors
WARN: All warnings
INFO: All significant events
DEBUG: Detailed debugging information

Distributed Tracing

Google Cloud Trace:

typescript
// Trace API requests
import { trace } from '@google-cloud/trace-agent';

// Initialize
trace.start({
  projectId: 'yebolearn-prod',
  samplingRate: 10, // 10% of requests
});

// Automatic tracing for HTTP requests
// Manual span for specific operations
export async function generateQuiz(topic: string) {
  const span = trace.createChildSpan({ name: 'generateQuiz' });

  try {
    // Call Gemini API
    const quizSpan = trace.createChildSpan({ name: 'gemini-api-call' });
    const quiz = await geminiClient.generateContent(prompt);
    quizSpan.endSpan();

    // Save to database
    const dbSpan = trace.createChildSpan({ name: 'save-quiz' });
    await db.quiz.create({ data: quiz });
    dbSpan.endSpan();

    return quiz;
  } finally {
    span.endSpan();
  }
}

Trace Analysis:

Request Trace Example:
GET /api/student/dashboard
Total: 450ms
├─ Authentication middleware: 25ms
├─ Fetch student data: 180ms
│   ├─ Database query: 145ms
│   └─ Cache lookup: 35ms
├─ Fetch enrollments: 120ms
│   └─ Database query: 115ms
├─ Calculate progress: 85ms
│   └─ Aggregation logic: 80ms
└─ Serialize response: 40ms

Insights:
- Database queries taking 260ms (58%)
- Opportunity: Add caching for enrollments

Dashboards

Executive Dashboard

High-Level Metrics (Grafana):

YeboLearn Platform Health
┌─────────────────────────────────────────┐
│ System Status                            │
│ ✓ All Systems Operational               │
│ Uptime: 99.97% (30 days)                │
│ Active Users: 2,340                     │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│ Performance                              │
│ API Response Time: 145ms (p50)          │
│ Page Load Time: 1.9s (avg)              │
│ Error Rate: 0.3%                        │
└─────────────────────────────────────────┘

┌─────────────────────────────────────────┐
│ Business Metrics (Today)                │
│ Quiz Completions: 1,240                 │
│ AI Features Used: 340                   │
│ New Signups: 28                         │
│ Revenue: $420                           │
└─────────────────────────────────────────┘

Engineering Dashboard

Detailed Technical Metrics:

API Performance
├─ Request Rate: 45 req/s (avg), 120 req/s (peak)
├─ Response Time: p50=145ms, p95=380ms, p99=820ms
├─ Error Rate: 0.3% (target: <1%)
└─ Top Endpoints by Latency:
    1. /api/student/progress - 210ms
    2. /api/analytics/dashboard - 280ms
    3. /api/ai/generate-quiz - 8,500ms (AI feature)

Database Performance
├─ Query Time: 35ms (avg), 180ms (p95)
├─ Connections: 12/25 active
├─ Slow Queries (>100ms): 15/hour
├─ Cache Hit Rate: 94%
└─ Index Hit Rate: 98.5%

Infrastructure
├─ Cloud Run Instances: 2 (avg), 8 (max)
├─ CPU Usage: 35% (avg), 78% (peak)
├─ Memory Usage: 68% (avg), 85% (peak)
├─ Database CPU: 45% (avg), 82% (peak)
└─ Storage: 38GB/100GB (38%)

Error Tracking (Last 24 Hours)
├─ Total Errors: 45
├─ New Errors: 3
├─ Resolved Errors: 12
└─ Top Errors:
    1. Database timeout - 12 occurrences
    2. Gemini API rate limit - 8 occurrences
    3. Invalid quiz submission - 6 occurrences

AI Features Dashboard

AI-Specific Metrics:

AI Feature Performance
├─ Quiz Generation
│   ├─ Requests: 180/day
│   ├─ Avg latency: 8s
│   ├─ Success rate: 99.2%
│   ├─ Cost: $0.15/request
│   └─ Quality score: 9.1/10
├─ Essay Grading
│   ├─ Requests: 45/day
│   ├─ Avg latency: 45s
│   ├─ Success rate: 98.5%
│   ├─ Cost: $0.12/request
│   └─ Teacher approval: 87%
└─ Content Recommendations
    ├─ Requests: 2,400/day
    ├─ Avg latency: 2s
    ├─ Click-through: 34%
    └─ Cost: $0.02/request

Gemini API Usage
├─ Requests: 12,000/day
├─ Tokens In: 4.2M/day
├─ Tokens Out: 1.6M/day
├─ Total Cost: $6.50/day
├─ Rate Limits: 25/60 req/min (42%)
└─ Error Rate: 1.2%

Alerting

Alert Configuration

Critical Alerts (PagerDuty):

yaml
# API Down
- name: api_down
  condition: uptime < 99% for 2 minutes
  severity: critical
  notify: pagerduty
  escalation: immediate

# High Error Rate
- name: high_error_rate
  condition: error_rate > 5% for 3 minutes
  severity: critical
  notify: pagerduty
  escalation: after 5 minutes

# Database Down
- name: database_down
  condition: db_connections = 0 for 1 minute
  severity: critical
  notify: pagerduty + cto
  escalation: immediate

# Payment Processing Failed
- name: payment_failures
  condition: payment_failure_rate > 10% for 2 minutes
  severity: critical
  notify: pagerduty + finance
  escalation: after 10 minutes

Warning Alerts (Slack #engineering):

yaml
# Slow API Response
- name: slow_api
  condition: p95_latency > 1s for 10 minutes
  severity: warning
  notify: slack
  message: "API response time elevated: {{value}}ms"

# High Memory Usage
- name: high_memory
  condition: memory_usage > 80% for 15 minutes
  severity: warning
  notify: slack
  message: "Memory usage: {{value}}%"

# Increased Error Rate
- name: elevated_errors
  condition: error_rate > 2% for 10 minutes
  severity: warning
  notify: slack
  message: "Error rate elevated: {{value}}%"

# AI API Rate Limit Approaching
- name: gemini_rate_limit
  condition: gemini_requests > 50/min for 5 minutes
  severity: warning
  notify: slack
  message: "Approaching Gemini API rate limit: {{value}} req/min"

Alert Best Practices:

Effective Alerts:
✓ Actionable (team can fix)
✓ Specific (clear what's wrong)
✓ Timely (detect before users)
✓ Relevant (not noise)

Alert Fatigue Prevention:
✓ Group related alerts
✓ Deduplicate similar alerts
✓ Adjust thresholds based on patterns
✓ Auto-resolve when issue clears
✓ Review and tune monthly

On-Call Rotation

Schedule:

Weekly rotation:
- Week 1: Sarah
- Week 2: John
- Week 3: Lisa
- Week 4: Mark

On-call responsibilities:
- Respond to PagerDuty alerts (24/7)
- Triage and resolve P0/P1 incidents
- Escalate if needed
- Document incident in postmortem
- Handoff status to next on-call

On-call capacity:
- Protected from sprint commitments
- Focus on monitoring and incidents
- Handle urgent bugs and hotfixes

Uptime Tracking

External Monitoring

UptimeRobot Configuration:

Monitored Endpoints:
├─ https://api.yebolearn.app/health
│   ├─ Check interval: 1 minute
│   ├─ Timeout: 30 seconds
│   └─ Expected: 200 OK + "healthy" in response
├─ https://yebolearn.app
│   ├─ Check interval: 5 minutes
│   ├─ Timeout: 30 seconds
│   └─ Expected: 200 OK
└─ https://api.yebolearn.app/api/v1/status
    ├─ Check interval: 5 minutes
    ├─ Timeout: 10 seconds
    └─ Expected: 200 OK + valid JSON

Notifications:
- Alert on: Down for 2 minutes
- Notify: PagerDuty + Slack
- Escalation: Email team lead after 10 minutes

Health Check Endpoint

typescript
// Comprehensive health check
export async function healthCheck(): Promise<HealthStatus> {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
    checkGeminiAPI(),
    checkEmailService(),
    checkPaymentGateway(),
  ]);

  const results = {
    database: getCheckResult(checks[0]),
    redis: getCheckResult(checks[1]),
    gemini: getCheckResult(checks[2]),
    email: getCheckResult(checks[3]),
    payment: getCheckResult(checks[4]),
  };

  const allHealthy = Object.values(results).every(
    r => r.status === 'healthy'
  );

  return {
    status: allHealthy ? 'healthy' : 'degraded',
    timestamp: new Date().toISOString(),
    version: process.env.APP_VERSION,
    uptime: process.uptime(),
    checks: results,
  };
}

async function checkDatabase(): Promise<CheckResult> {
  try {
    const start = Date.now();
    await db.$queryRaw`SELECT 1`;
    const latency = Date.now() - start;

    return {
      status: 'healthy',
      latency: `${latency}ms`,
    };
  } catch (error) {
    return {
      status: 'unhealthy',
      error: error.message,
    };
  }
}

Uptime Targets

SLA: 99.9% uptime (three nines)
Allowed downtime: 43 minutes/month

Current performance:
- Last 30 days: 99.97% (13 min downtime)
- Last 90 days: 99.95% (65 min downtime)
- Last 12 months: 99.93% (6.1 hours downtime)

Status: ✓ Exceeding SLA

Incident Management

Incident Response Process

1. Detection (Target: <2 min)

  • Alert fires
  • On-call engineer paged
  • Initial triage begins

2. Assessment (Target: <5 min)

  • Determine severity (P0-P3)
  • Identify affected systems
  • Estimate user impact

3. Response (Target: <1 hour)

  • Mitigate immediate impact
  • Apply fix or rollback
  • Communicate status

4. Resolution

  • Verify fix deployed
  • Monitor for recurrence
  • Update status page

5. Postmortem (Within 48 hours)

  • Document incident timeline
  • Root cause analysis
  • Action items to prevent recurrence

Incident Severity Levels

P0 - Critical:
- Complete service outage
- Data loss risk
- Security breach
Response: Immediate, all hands on deck
Example: API completely down, database corruption

P1 - High:
- Major feature broken
- Payment processing down
- Significant user impact
Response: Within 15 minutes
Example: Quiz submissions failing, M-Pesa integration down

P2 - Medium:
- Minor feature degraded
- Performance issues
- Moderate user impact
Response: Within 2 hours
Example: Slow dashboard, AI features timing out

P3 - Low:
- Cosmetic issues
- Minor bugs
- Minimal user impact
Response: Next business day
Example: Typo in UI, broken link in email

YeboLearn - Empowering African Education