Back to Case Studies
aiweb

AI Education Platform

Government-funded platform for healthcare AI training in Korea, replacing expensive cloud services with on-premise GPU infrastructure. Built with NestJS and FastAPI to manage 40 concurrent students across 4 Tesla V100 GPUs partitioned via NVIDIA MIG. Features isolated Jupyter environments, unlimited GPU access for medical dataset training, custom Prometheus monitoring for GPU utilization, and role-based access to shared/private datasets. Solved the challenge of providing secure, cost-effective AI education at scale.

20245 min read
AI Education Platform interface showing Jupyter notebook environment with GPU resource monitoring, medical dataset access, and student workspace management for 40 concurrent users

The Challenge

A government-funded healthcare AI research initiative in Korea needed a platform for training medical professionals on AI/ML using specialized medical datasets. Commercial cloud solutions like Google Colab couldn't provide unlimited GPU access, couldn't host proprietary medical data securely, and lacked the isolated multi-user environment required for structured training programs.

This was part of a national research project focused on advancing healthcare AI capabilities. The platform needed to support hands-on training sessions where 40 students could simultaneously work with GPU-intensive deep learning models on prepared medical datasets, all while maintaining data isolation and security.

Key Constraints

  • Support exactly 40 concurrent students (limited by 4x Tesla V100 GPUs)
  • Provide isolated workspaces with shared read-only datasets and private user folders
  • Enable unlimited GPU usage during training (unlike commercial cloud limits)
  • Secure handling of proprietary medical training datasets
  • Real-time monitoring of GPU utilization across MIG partitions
  • Cost-effective alternative to commercial cloud services

Our Approach

Built a self-hosted Jupyter platform with dual-service architecture: NestJS GraphQL API for user management and session control, and FastAPI service for Docker container lifecycle and GPU allocation. Used NVIDIA MIG (Multi-Instance GPU) to partition 4 Tesla V100 GPUs into isolated instances for fair resource distribution.

Key Technical Decisions

  • NestJS GraphQL + FastAPI dual architecture - separation of concerns between user management and container orchestration
  • NVIDIA MIG for GPU partitioning - fair resource allocation across 40 concurrent users from 4 physical GPUs
  • Docker for environment isolation - each student gets isolated Jupyter container with mounted shared/private volumes
  • Custom Prometheus GPU exporters - built custom monitoring solution since standard tools don't support MIG partitioning
  • MongoDB for flexibility - rapid iteration on user/session schemas during government project development
  • Socket.io for real-time updates - instant container status and resource usage feedback to students and instructors

Timeline: 6 months from initial planning to production deployment with full monitoring stack

Implementation

Architecture Design & Infrastructure Setup

Designed dual-service architecture, set up GPU servers with NVIDIA MIG, configured Docker networking, and established development environment. Planned resource allocation strategy for 40 concurrent users across 4 Tesla V100 GPUs.

4-6 weeks

Core Platform Development

Built NestJS GraphQL API with role-based authentication, user management, and session control. Developed FastAPI service with Docker SDK for container lifecycle management and GPU allocation via NVIDIA runtime.

8-10 weeks

Container Orchestration System

Most critical phase - implemented sophisticated container management system with volume mounting (shared read-only medical datasets + private user folders), GPU resource assignment via MIG, and automatic cleanup. Required significant debugging and optimization.

6-8 weeks

Custom Monitoring Solution

Built custom Prometheus exporters for MIG-aware GPU monitoring (most standard tools don't support MIG). Integrated Grafana dashboards for real-time GPU utilization, memory usage, and per-user resource tracking. Essential for managing concurrent workloads.

3-4 weeks

Frontend & Real-time Features

Developed Next.js frontend with Apollo Client for GraphQL, Socket.io for real-time container status updates, Jupyter notebook management interface, and instructor dashboard for monitoring all student sessions.

4-5 weeks

Testing & Production Deployment

Load testing with 40 concurrent users, stress testing GPU allocation under peak loads, security testing for container isolation, and production deployment with SSL/Nginx reverse proxy setup.

2-3 weeks

System Architecture

AI Education Platform architecture with NestJS GraphQL API, FastAPI Docker orchestration, NVIDIA MIG GPU partitioning across 4 Tesla V100 GPUs, and Prometheus monitoring stack

The platform uses NVIDIA MIG to partition each Tesla V100 into multiple GPU instances (10 instances per GPU = 40 total instances for 40 users). FastAPI service manages Docker containers with NVIDIA runtime, mounting shared datasets (read-only) and private user volumes. NestJS GraphQL API handles authentication, user CRUD, and session management. Custom Prometheus exporters query NVIDIA MIG metrics via nvidia-smi, exposing per-partition GPU utilization and memory usage. Grafana dashboards visualize real-time resource consumption across all student sessions. Next.js frontend uses Apollo Client for GraphQL queries/mutations and Socket.io for real-time container status updates.

Technology Stack

Next.jsTypeScriptApollo ClientSocket.ioNestJSFastAPIGraphQLDocker SDKMongoDBDockerDocker ComposeNginxSSLPrometheusGrafanaNode ExporterCustom GPU ExportersJupyter NotebookPyTorchTensorFlowScikit-learn

Results & Impact

40Concurrent Users

Successfully supported 40 students simultaneously running GPU-intensive deep learning workloads

4x V100Tesla GPUs

Partitioned via NVIDIA MIG to provide isolated GPU instances for each student

100%GPU Utilization

Tested peak concurrent load across all GPU partitions with real medical AI training workloads

6 monthsDevelopment

From architecture design to production deployment with monitoring stack

  • Enabled unlimited GPU training for medical AI research (no cloud quota limits)
  • Provided secure environment for proprietary medical datasets (on-premise control)
  • Replaced need for expensive commercial cloud services (Google Colab Pro, AWS SageMaker)
  • Successfully conducted multiple training sessions for healthcare professionals
  • Achieved fair resource distribution across 40 users via MIG partitioning
  • Real-time monitoring enabled proactive resource management and troubleshooting

What We Learned

  • Container orchestration at scale is complex - extensive work needed on lifecycle management, volume mounting, and resource cleanup. Docker SDK requires careful error handling.
  • NVIDIA MIG monitoring gaps - most standard monitoring tools don't support MIG, requiring custom Prometheus exporters. This was unexpected and time-consuming.
  • Microservices separation paid off - splitting user management (NestJS) from container control (FastAPI) made development and debugging much easier despite initial overhead.
  • Real-time feedback is critical - Socket.io updates for container status significantly improved UX, students could see GPU allocation and startup progress instantly.
  • Load testing is essential - simulating 40 concurrent users revealed race conditions in GPU allocation that wouldn't appear with small user counts.
  • Government projects require flexibility - MongoDB's schema flexibility was valuable during requirement changes mid-development. Postgres would have slowed iterations.

Have a similar project in mind?

Let's discuss how we can help you build it

More Case Studies

Speech Coach AI platform dashboard showing real-time speech analysis with pace tracking, filler word detection, and emotional tone visualization

Speech Coach

AI-powered speech coaching platform that democratizes public speaking improvement. Built with Next.js and LLM APIs, the platform analyzes speech in real-time, providing instant feedback on pace, clarity, filler words, and emotional tone. Serving 10K+ users who need affordable, 24/7 access to personalized coaching—replacing expensive $100-300/hour human coaches with AI that scales.

DICOM Routing Platform dashboard displaying medical imaging data flow, real-time monitoring of 100GB+ daily DICOM transfers, and microservices health status

DICOM Routing Platform

Enterprise medical imaging platform built for US telemedicine providers to route DICOM data from distributed clinics. Processes 100GB+ daily with zero downtime using microservices architecture (FastAPI, Redis Streams, HAProxy). Ensures HIPAA compliance, provides audit trails for healthcare regulations, and scales seamlessly from single-clinic to multi-site deployments. Features real-time monitoring dashboard and handles concurrent connections from dozens of imaging devices.

Orthanc PACS dashboard showing DICOM studies list, patient metadata, and system monitoring with CloudWatch metrics

Orthanc PACS Deployment

Production DICOM PACS system deployed on AWS for healthcare startup. Orthanc + S3 + PostgreSQL architecture handling 1500+ DXA bone density scans with VPN-only access, automated backups, and CloudWatch monitoring. Deployed in 3 weeks with defense-in-depth security.

PACS platform study list interface showing advanced filtering, batch operations, and real-time study management with pagination and search

PACS Platform Modernization

Complete modernization of legacy PACS system handling 21TB of medical imaging data. Custom Next.js platform with Orthanc backend, PostgreSQL indexing, and Redis caching. Improved performance from 3-4 studies/second to 100 studies in under 2 seconds. Multi-site deployment with role-based access control and OHIF viewer integration.