Self-Hosted AI Infrastructure: Why I Built an AI Workstation Instead of Renting GPUs

Artificial Intelligence is rapidly transforming how we build products, design systems and solve business use cases. Unfortunately, building an AI infrastructure is getting increasingly expensive due to GPU shortages and skyrocketing memory prices. Developing enterprise grade AI products often require powerful GPUs with large VRAMs. While large organizations can justify these investments, small organizations struggle to secure budgets for even small AI infrastructure.

Consider this – even renting a single GPU such as A100 with 80GB VRAM from a cloud provider will cost approx USD 30000/year. Business leaders naturally ask for an investment ROI before approving even modest budgets.

With such high costs for AI infrastructure, how will the larger community of AI developers, researchers and product managers build financially viable solutions?

The Case for Self-Hosted AI Infrastructure

A viable business strategy for various organizations is to locally host GPU infrastructure and reserve cloud resources only for production environments. For example:

Experimentation, proof of concepts and developments run on local on-prem infrastructure
Quality assurance follows a hybrid model – it is first stage tested using local infrastructure; the second stage is tested on the cloud infrastructure before final rollout.

While this approach requires an immediate upfront investment, it can significantly reduce the total cost of ownership over few years for all the development needs. A typical production grade platform runs 24/7 and hence results in higher utilization rates, but a development environment does not run 24/7. Relying on rented cloud infrastructure would yield much lower utilization and does not justify the investments.

To explore this deployment model, I have built a high performance AI workstation at home capable of running open source large language models without relying on the expensive cloud GPUs.

System Specifications

Hardware:

CPU: AMD Ryzen 9 9950X
GPU: RTX Pro 6000 Blackwell Workstation Edition (96 GB VRAM)
RAM: 96 GB DDR5 (expandable upto 192 GB)
Storage: 4 TB NVMe

Software:

OS: Ubuntu 24.04
Inference: vLLM, Transformers
Containerization: Docker + NVIDIA Container Toolkit
CUDA: 13.0
ML Frameworks / SDKs: Python, PyTorch, Nvidia TensorRT

Real World AI Development

This workstation can comfortably run both Mixture of Experts (MoE) models and dense models with approx 30 billion parameters and without any quantization. Moreover, the GPU still has ample VRAM to run inferences. This workstation, powered by latest Nvidia drivers and CUDA tools can help build custom transformers, or newer architectures to meet business needs.

Essentially, I am trying to solve real business use cases and define a scalable AI strategy using open source models and smaller transformer implementations which do not require an expensive cloud infrastructure.

One example is the Qwen3-30B-A3B base model. Running this base model at full precision consumed approx 62 GB of VRAM, leaving the remaining 34 GB for inference. Using vLLM inference engine, this deployment could get generate 5000+ tokens/sec during synthetic stress tests involving simple prompts and prefix cache ratio of 80%. Although this does not reflect a real world production scenario, but it does help understand the capabilities of the platform.

Why I Built a Home AI Lab Instead of Renting GPUs?

The answer is simple – cost, flexibility and ownership.

1. Long Term Cost Efficiency

Cloud based GPUs are extremely efficient but renting them continuously gets expensive. The spot instances can significantly reduce costs, but they introduce operational risks:

Workloads can be interrupted without warning. Long running experiments can be disrupted.
Capacity might not be available during peak demand cycles.

Hence, long term strategy cannot be built using spot instances.

2. Flexibility & Freedom to Experiment

Owning the platform removes the psychological pressure of constantly monitoring the GPU utilization. I can:

Experiment with new models
Evaluate new software stack
Pause projects even for weeks
Run benchmarks

And all this is possible without worrying about the monthly bill. And don’t forget the ample learning opportunities that exist while building native applications using GPU.

Infrastructure Cost Comparison

Following is the cost of deploying an RTX Pro 6000 Blackwell + CPU compute workstation on various cloud platforms vis-a-vis self hosted model:

Deployment	Cost Per Hour	Cost Per Month	Cost Per Year
AWS / GCP / Azure	$3.3 – $4.0	$2400 – $2900	$28K – $35K
Specialized GPU Infra Providers	$1.6 – $2.2	$1150 – $1600	$14K – $19K
Self Hosted (Ammortized over 3 years)	Less than $0.5	$350	$4K

Even after accounting for electricity, maintenance and depreciation, a self-hosted infrastructure can provide substantial savings year on year.

Lessons Learned Building a Blackwell AI Workstation

Building this workstation was far more intriguing than just buying the hardware. This was more like a project which required work across multiple layers:

Hardware selection
Ubuntu installation
NVIDIA drivers
Right CUDA version selection
CUDA enabled Pytorch selection
Setting up native vLLM inference engine
Blackwell GPU compatibility testing
Benchmarking, concurrent user testing
Testing the entire stack for compatibility

Some of the lessons that stand out:

1. GPU Selection is Easy. Software Compatibility is Not

The GPU selection was easy, but getting a working software stack was very difficult. Blackwell is a recent generation GPU with new drivers, and many libraries and frameworks have not yet published compatible wheels. The real challenges stemmed from:

Driver compatibility
CUDA compatibility
Inference framework support
Containerization
Partitioning the GPU

During the setup process, I encountered multiple segmentation faults, kernel panics and out of memory errors before arriving at a working software stack.

2. Plan Model Storage Strategy Early

Large models consume significant storage and generate significant I/O operations on the kernel.

During initial testing, downloading and managing large Hugging Face models resulted in I/O bottlenecks and deadlocks.

Some lessons learnt:

Use dedicated storage for model repositories. Establish a model registry to prevent duplicate copies of large language models.
Optimize Hugging Face caching and download settings to prevent aggressive disk activities.
Avoid mixing operating system workloads and model storage drives on the same mount points.

3. Docker Should Be Part of the Design From Day One

One of the biggest challenge was the CUDA compatibility with the entire software stack.

At the time of writing, vLLM inference engine is validated against CUDA 12.9, while Blackwell GPUs are designed to take advantages of CUDA 13.2 features. Newer features are released in CUDA 13.2 which are not available in 12.x

Running this stack natively resulted into compatibility issues and segmentation faults. The solution was to leverage Dockers with Nvidia Container Toolkit.

This approach:

Isolates the runtime environments. I can have CUDA 13.x installed on the workstation, and CUDA 12.9 in the docker container.
Reproducible deployments
Simplified dependency management
Support multiple Hugging Face in case of benchmark testing
For a modern AI infrastructure, containerization should be considered as a foundational requirement rather than an optional enhancement.

4. Mixed Vendor GPUs Can Create Surprising Issues

The Ryzen 9 9950X includes an integrated AMD GPU alongside the dedicated NVIDIA GPU.

Many AI frameworks automatically enumerate over all the available GPUs during their initialization stages since they are designed for multi-GPU environments.

In this workstation, presence of two different vendors resulted in runtime conflicts and library mismatches.

The solution was to disable or blacklist the integrated AMD GPU for all AI workloads, thereby ensuring that NVIDIA device gets picked up.

Final Thoughts

After completing this project, one conclusion stands out:

Building an AI workstation is not primarily a hardware exercise – it is a systems engineering project that happens to include a powerful GPU.

The GPU may be the most important component, but the real work is in defining a scalable, stable and reproducible software stack.

For developers, startups, consultants and product builders, having a self-hosted AI workstation can provide exception value. If offers complete control over operational costs, token anxiety and ability to build sophisticated AI products.