Train a 120 Billion Parameter Model on One GPU. It Actually Works.
The GPU memory wall has been the gatekeeper for serious LLM training since the beginning. If your model does not fit in VRAM, you need more GPUs, more money, more infrastructure. MegaTrain just made that wall optional for a lot of teams.
A paper dropped on arXiv on April 6