Monitoring & Observability

Proper monitoring is essential for maintaining a healthy Vemetric deployment. This guide covers logging, metrics, queue monitoring, and troubleshooting.

Logging

Application Logs

Vemetric uses Pino for structured JSON logging across all services.

App Service
Hub Service
Worker Service

# View logs (Docker)
docker logs -f vemetric-app

# View logs (local)
bun dev
# Logs appear in terminal

Log levels:

trace: Detailed debug info
debug: Debug information
info: General information
warn: Warning messages
error: Error messages
fatal: Fatal errors

# View Hub logs
docker logs -f vemetric-hub

Key log entries:

Event ingestion requests
Bot detection
Project validation
Geolocation lookups
Queue job creation

# View Worker logs
docker logs -f vemetric-worker

Key log entries:

Job processing start/completion
Job failures with stack traces
Worker initialization
Database write operations

Log Format

Pino outputs structured JSON logs:

{
  "level": 30,
  "time": 1709582400000,
  "pid": 12345,
  "hostname": "app-server",
  "msg": "Event received",
  "projectId": "abc123",
  "eventName": "page_view",
  "userId": "user-123"
}

Pretty Printing (Development)

In development, logs are formatted with pino-pretty:

[2024-03-04 12:00:00.000] INFO: Event received
  projectId: "abc123"
  eventName: "page_view"
  userId: "user-123"

Log Aggregation (Production)

For production, ship logs to a centralized service:

Axiom
Elasticsearch
CloudWatch

Vemetric includes optional Axiom integration:

.env

AXIOM_DATASET=vemetric-logs
AXIOM_TOKEN=your-axiom-token

Install @axiomhq/pino (already included in dependencies).

Use Filebeat or Fluentd to ship logs to Elasticsearch:

docker-compose.yml

app:
  logging:
    driver: "fluentd"
    options:
      fluentd-address: localhost:24224
      tag: vemetric.app

For AWS deployments:

docker-compose.yml

app:
  logging:
    driver: "awslogs"
    options:
      awslogs-region: us-east-1
      awslogs-group: vemetric
      awslogs-stream: app

Queue Monitoring

BullBoard UI

Vemetric includes Bull Board for real-time queue monitoring.

Access BullBoard

Navigate to:

http://localhost:4100

Username: BULLBOARD_USERNAME (default: bullboard)
Password: BULLBOARD_PASSWORD (default: password)

Monitor Queues

BullBoard shows all queues:

event-queue: Event processing
session-queue: Session aggregation
user-queue: User updates
device-queue: Device tracking
email-queue: Email delivery
first-event-queue: First event handling
enrich-user-queue: User enrichment
merge-user-queue: User merging

View Job Details

For each queue, you can:

View active, waiting, completed, and failed jobs
Inspect job data and results
View error stack traces for failed jobs
Retry or delete individual jobs
Pause/resume queues

Secure BullBoard with strong credentials and restrict network access in production. It provides full access to job data and queue controls.

Queue Metrics

Monitor queue health with Redis CLI:

# Connect to Redis
docker exec -it vemetric-redis redis-cli

# Count jobs in event queue
> LLEN bull:event-queue:wait

# View queue stats
> HGETALL bull:event-queue:meta

# List all queue keys
> KEYS bull:*:wait

Failed Jobs

Failed jobs are automatically stored in PostgreSQL:

SELECT 
  id,
  queueName,
  createdAt,
  error,
  data
FROM failed_queue_job
ORDER BY createdAt DESC
LIMIT 10;

This helps debug persistent failures.

Database Monitoring

PostgreSQL

Connection Count

SELECT 
  count(*) as connections,
  state
FROM pg_stat_activity
WHERE datname = 'vemetric'
GROUP BY state;

Database Size

SELECT 
  pg_size_pretty(pg_database_size('vemetric')) as size;

Table Sizes

SELECT 
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Slow Queries

Enable query logging in postgresql.conf:

log_min_duration_statement = 1000  # Log queries > 1s

View slow queries:

SELECT 
  query,
  calls,
  total_exec_time,
  mean_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

ClickHouse

Table Statistics

SELECT
  table,
  formatReadableSize(sum(bytes)) AS size,
  formatReadableQuantity(sum(rows)) AS rows,
  count() AS parts
FROM system.parts
WHERE database = 'vemetric' AND active
GROUP BY table
ORDER BY sum(bytes) DESC;

Query Performance

SELECT
  query,
  formatReadableSize(memory_usage) AS memory,
  elapsed AS duration,
  read_rows,
  formatReadableSize(read_bytes) AS read_size
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_date = today()
ORDER BY duration DESC
LIMIT 10;

Merge Performance

SELECT
  table,
  elapsed,
  progress,
  formatReadableSize(total_size_bytes_compressed) AS size
FROM system.merges
WHERE database = 'vemetric';

Disk Usage

SELECT
  name,
  path,
  formatReadableSize(free_space) AS free,
  formatReadableSize(total_space) AS total
FROM system.disks;

Redis

# Connect to Redis
docker exec -it vemetric-redis redis-cli

# Memory stats
> INFO memory

# Keyspace stats
> INFO keyspace

# Client connections
> CLIENT LIST

# Slow log
> SLOWLOG GET 10

Health Checks

Service Health Endpoints

App Service
Hub Service

curl http://localhost:4000/api/health

Response:

{
  "status": "ok",
  "timestamp": "2024-03-04T12:00:00.000Z"
}

curl http://localhost:4004/health

Response:

{
  "status": "ok"
}

Docker Health Checks

Add health checks to your Docker Compose:

docker-compose.yml

services:
  app:
    image: vemetric-app
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
  
  hub:
    image: vemetric-hub
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4004/health"]
      interval: 30s
      timeout: 10s
      retries: 3
  
  postgres:
    image: postgres:17-alpine
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
  
  clickhouse:
    image: clickhouse/clickhouse-server:23.10-alpine
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8123/ping"]
      interval: 30s
      timeout: 10s
      retries: 3
  
  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

Check health status:

docker-compose ps

Metrics & Dashboards

ClickHouse Metrics

ClickHouse exposes Prometheus metrics on port 9363:

curl http://localhost:9363/metrics

Prometheus + Grafana Setup

Add Prometheus and Grafana to your stack:

prometheus:
  image: prom/prometheus:latest
  volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
    - prometheus_data:/prometheus
  ports:
    - "9090:9090"

grafana:
  image: grafana/grafana:latest
  volumes:
    - grafana_data:/var/lib/grafana
  ports:
    - "3000:3000"
  environment:
    - GF_SECURITY_ADMIN_PASSWORD=admin

Key Metrics to Monitor

Event Ingestion Rate

Track events/second through Hub service. Set up alerts for drops or spikes.

Queue Depth

Monitor BullMQ queue sizes. High depth indicates worker saturation.

Database Query Time

Track P95/P99 query latency for PostgreSQL and ClickHouse.

Error Rate

Monitor 4xx/5xx error rates in App and Hub services.

Memory Usage

Track Redis memory and ClickHouse memory usage.

Disk Space

Monitor disk usage for PostgreSQL, ClickHouse, and Redis volumes.

Error Tracking

Sentry Integration

Vemetric includes optional Sentry integration for error tracking:

.env

SENTRY_DSN=https://your-sentry-dsn@sentry.io/project-id

Sentry is already integrated via @sentry/bun in:

App service
Hub service
Worker service

Errors and exceptions are automatically reported to Sentry.

Alerting

Set up alerts for critical conditions:

Queue Depth Alerts

Alert when queue depth exceeds threshold:

prometheus-alerts.yml

groups:
  - name: vemetric
    rules:
      - alert: HighQueueDepth
        expr: bull_queue_waiting_jobs > 1000
        for: 5m
        annotations:
          summary: "High queue depth detected"

Database Disk Space

Alert when disk usage exceeds 80%:

- alert: LowDiskSpace
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.2
  for: 10m
  annotations:
    summary: "Low disk space on database server"

Service Down

Alert when service health checks fail:

- alert: ServiceDown
  expr: up{job="vemetric-app"} == 0
  for: 2m
  annotations:
    summary: "Vemetric App service is down"

Troubleshooting

Common Issues

High queue depth

Symptoms: BullBoard shows thousands of waiting jobsCauses:

Worker service not running
Worker overwhelmed by job volume
Database connection issues

Solutions:

Check worker logs: docker logs vemetric-worker
Verify database connectivity
Scale worker horizontally (run multiple instances)
Increase worker concurrency in worker configuration

Events not appearing in dashboard

Symptoms: Events sent but not visible in analyticsDebugging:

Check Hub logs for event receipt: docker logs vemetric-hub
Verify project token is correct
Check BullBoard for job processing

Query ClickHouse directly:

SELECT * FROM event ORDER BY createdAt DESC LIMIT 10;

Check Worker logs for errors

Slow dashboard queries

Symptoms: Dashboard takes >5 seconds to loadSolutions:

Check ClickHouse query performance:

SELECT query, elapsed FROM system.query_log 
WHERE type = 'QueryFinish' ORDER BY elapsed DESC LIMIT 5;

Optimize slow queries with materialized views
Reduce date range for large datasets
Add indexes if needed
Scale ClickHouse vertically (more CPU/RAM)

Database connection pool exhausted

Symptoms: “too many clients” or “connection pool timeout” errorsSolutions:

Add PgBouncer for PostgreSQL connection pooling

Increase PostgreSQL max_connections:

ALTER SYSTEM SET max_connections = 200;
SELECT pg_reload_conf();

Review app connection pool settings
Check for connection leaks in application code

ClickHouse out of memory

Symptoms: ClickHouse queries fail with memory errorsSolutions:

Increase ClickHouse memory limit in config.xml
Optimize queries to process less data
Add LIMIT clauses to queries
Use sampling for large datasets:
```
SELECT ... FROM event SAMPLE 0.1
```
Scale ClickHouse vertically

Debug Mode

Enable verbose logging:

.env

LOG_LEVEL=debug

This increases log verbosity across all services.

Performance Tuning

Worker Concurrency

Increase worker job concurrency:

const worker = new Worker('queue-name', processor, {
  concurrency: 10 // Process 10 jobs concurrently
});

Redis Maxmemory

Configure Redis memory limits:

maxmemory 2gb
maxmemory-policy allkeys-lru

ClickHouse Compression

Enable compression for better storage:

ALTER TABLE event MODIFY SETTING 
  storage_policy = 'default';

Database Indexes

Add indexes for frequent queries:

CREATE INDEX idx_project_created 
ON event(projectId, createdAt);

Get Started

Integration

Features

Self-Hosting

Guides

Documentation Index

​Monitoring & Observability

​Logging

​Application Logs

​Log Format

​Pretty Printing (Development)

​Log Aggregation (Production)

​Queue Monitoring

​BullBoard UI

​Queue Metrics

​Failed Jobs

​Database Monitoring

​PostgreSQL

​ClickHouse

​Redis

​Health Checks

​Service Health Endpoints

​Docker Health Checks

​Metrics & Dashboards

​ClickHouse Metrics

​Prometheus + Grafana Setup

​Key Metrics to Monitor

Event Ingestion Rate

Queue Depth

Database Query Time

Error Rate

Memory Usage

Disk Space

​Error Tracking

​Sentry Integration

​Alerting

​Troubleshooting

​Common Issues

​Debug Mode

​Performance Tuning

Worker Concurrency

Redis Maxmemory

ClickHouse Compression

Database Indexes

​Next Steps

Configuration

Architecture

Monitoring & Observability

Logging

Application Logs

Log Format

Pretty Printing (Development)

Log Aggregation (Production)

Queue Monitoring

BullBoard UI

Queue Metrics

Failed Jobs

Database Monitoring

PostgreSQL

ClickHouse

Redis

Health Checks

Service Health Endpoints

Docker Health Checks

Metrics & Dashboards

ClickHouse Metrics

Prometheus + Grafana Setup

Key Metrics to Monitor

Error Tracking

Sentry Integration

Alerting

Troubleshooting

Common Issues

Debug Mode

Performance Tuning

Next Steps