Spaceballs the Datacenter 2: The Search for More Bandwidth

Published on

At 4,152 words, this post will take 17 minutes to read.

Yogurt, the ancient and wise mystic from Spaceballs, clutching his cane in a dark cave—reluctant guide to orbital economics.

A few weeks ago, xAI-SpaceX announced a merger and Musk claimed the new business model to be AI datacenters in space. As a wise man once wrote, “This has made a lot of people very angry and been widely regarded as a bad move.”

While discussing the news with some space-inclined friends, they requested someone “do the math” to show that AI datacenters + space = bad move. I took up that charge.

In my first breakdown of the xAI–SpaceX datacenters, I concluded that building an AI-training datacenter in space is complete and utter bupkis. I used a lot of math and memes. I relentlessly made fun of Musk. We all had a very good time.

After posting my write-up to Substack and LinkedIn, I heard from literally tens of people who had valuable insights into the problem. Most notably, the CTO of Starcloud, which is the first (and as of this writing the only) company that has successfully launched and operated a GPU on a satellite. This team has built a very real, very operational BallSat called Starcloud-1.

I should probably listen to him, right? Yea I thought so too.

He graciously pointed out where my math was wrong, where it was right, and what I should investigate to build a more accurate model. He also shared the big hairy engineering problems they still need to solve to make this whole thing work. This is the sort of industry collaboration I appreciate. Real people sharing real information to do real things that, hopefully someday, make the world a little better place to live. You guys are cool.

This exchange and others like it got me thinking, “What would it take to build a REAL Spaceballs Datacenter with REAL BallSats that has very REAL business value?” Could I, a lowly cyber ops nerd who hasn’t touched thermodynamics since junior year, design such a thing?

Probably not. But I’m going to try anyway. Buckle up.

Part 1. Mea Culpa

First, let’s point and laugh at everything I did wrong the first go-round.

Wrong heat transfer equation. Spaceballs-1 treated space like a terrestrial cooling problem. The CTO of Starcloud shared that space thermal management doesn’t work this way. A radiator in space emits based on its own temperature and absorbs environmental heat loads from three sources (direct solar, Earth IR, and Earth albedo). Net heat rejection is the difference. Duh. Any semi-competent engineer would know that. Rule one about cyber: we don’t talk about cyber fight club. Rule two: we aren’t competent engineers. With the new inputs, my model ended up matching the CTO’s benchmark of >1,000 W/m².

Today’s tech vs tomorrow’s inventions. I used legacy NASA data for most inputs. This made all my calculations pretty conservative and, much like the original Spaceballs, based on 1980s technology. Spaceballs-2 lets every parameter (all 80+ of them) switch between two scenarios. Conservative uses ISS-era hardware specs and standard LEO orbits. Optimistic uses next-gen solar panels, lightweight radiators, and a sun-synchronous terminator orbit that eliminates eclipse entirely. The delta between scenarios is enormous. A 64-GPU BallSat weighs 6,067 kg conservative vs. 1,301 kg optimistic. I think it’s fair to assume that, as long as a technology doesn’t violate the laws of physics, it will improve at a linear rate (and sometimes exponentially, like in the case of Moore’s Law).

Bus body as a physical object. Spaceballs-1 gave the BallSat’s computer, avionics, comms, and other components zero physical volume. Spaceballs-2 estimates BallSat volume from packing density, derives surface area, and computes the thermal contribution. Turns out the BallSat body is a net heat rejector. GPU waste heat warms it enough that it radiates more than it absorbs from the environment. Small effect, but it’s real and it’s free.

Compute performance. Spaceballs-1 never said how much compute the satellite delivers. Spaceballs-2 adds verified compute specs and calculates total satellite compute, memory bandwidth and capacity, and inference token throughput. I didn’t end up using this in my final analysis. But it did get me thinking about other possible business models for datacenters in space.

Business model comparison. Spaceballs-1 asked “how much does a BallSat cost?” Spaceballs-2 asks “is there a business model and architecture where it makes money?” The new model evaluates five workload types against terrestrial alternatives: AI training, AI inference, public cloud, sovereign cloud, and edge/CDN compute. For each modality it captures terrestrial operator costs, customer pricing, utilization rates, bandwidth requirements, latency tolerances, and deployment scale. It then compares against what the satellite can actually deliver.

Comms modeling. Spaceballs-1 had none. Spaceballs-2 models telemetry, optical inter-satellite links (ISL), ground contact windows, and evaluates bandwidth sufficiency for all business models. It calculates latency and identifies shortfalls. This happens to be something I know a lot about and it turns out to be the single most important addition to the model. Hence the article title. Hey wouldn’t you know it, there’s pesky Chekhov’s Gun again…

Part 2. Debunking (most of) the Options

Remember Mr. Talkie from Part 1? The 6 kg afterthought that I slapped onto the BallSat like a sticker on a Trapper Keeper? In Part 1, I waved my hand and said “comms hardware has gotten remarkably compact” and moved on to more interesting problems.

Turns out Mr. Talkie runs this whole show.

See, I spent most of Spaceballs-1 obsessing over heat and mass, Chekhov’s thermal gun, the relationship between Mr. Radiator and Mr. Engine. All real problems. All worth modeling. But while I was busy computing radiator areas and reaction wheel torques, the communications problem sat quietly in the corner, unassuming, waiting for its moment. Like a junior analyst at McKinsey who just realized the entire strategy deck has a fatal flaw on slide one (spoiler alert: every McKinsey deck has a fatal flaw on slide one. It’s the part that says “McKinsey”).

I would now like to introduce the protagonist of Spaceballs-2. Mr. Talkie’s bigger, meaner, more consequential older brother: the inter-satellite link.

The Bandwidth Wall

Spaceballs-1 modeled a single-GPU satellite with a basic Ka-band transceiver, enough bandwidth to phone home and maybe push 100 Mbps of useful data. For a single lonely GPU doing inference in the void, that’s… technically fine. Sad, but fine.

Spaceballs-2 asks a harder question. What if these satellites need to talk to each other? What if you want your GPUs to cooperate? What if, God forbid, you want to train a model?

You need inter-satellite links. Laser-based optical terminals that beam data between satellites. Starlink uses them. And honestly, the engineering behind them genuinely impresses me. Here’s where the technology stands as of early 2026:

Production hardware: Each Starlink V2 satellite carries three optical ISL terminals running at roughly 100 Gbps per link. That’s real, deployed, operational hardware serving millions of customers. SpaceX solved this at scale. Credit where it’s due.

Near-term demonstrated: China’s Shanghai Institute of Optics and Fine Mechanics (SIOM) set a record of 400 Gbps single-channel free-space optical in a lab demonstration (not space-flown hardware, but still a viable proof of concept).

The future roadmap: Coherent wavelength-division multiplexing could push individual links toward 1 Tbps. ESA’s HydRON program envisions terabit-class optical networking in orbit. These are 2030+ technologies.

Now for NVIDIA’s GPU needs.

NVLink 5.0, the interconnect inside an NVIDIA DGX node, gives every B200 GPU 1.8 terabytes per second of bidirectional bandwidth to its neighbors. Quick unit conversion for the uninitiated: capitalized TBps means terabytes per second. Lowercase Tbps means terabits per second. Eight bits per byte. So 1.8 TB/s = 14.4 Tbps of bandwidth per GPU.

Our future ISL bandwidth: 0.40 Tbps

NVLink requirement: 14.4 Tbps

The gap: 36×

We have a garden hose. We need a fire hydrant. And that 36× gap is the optimistic scenario using China’s lab record, which hasn’t been replicated in orbit. With production Starlink hardware, the gap balloons to 144×.

I’ll say it louder for the people in the cheap seats: even the best optical inter-satellite links on the planet (well, off the planet) deliver 36 to 144 times less bandwidth than what GPUs need to work together on a training job.

You cannot train a large AI model across multiple satellites. You cannot synchronize gradients. You cannot do tensor parallelism across an ISL. The physics say no. The photons say no. Chekhov’s Gun says BLAM.

The satellite IS the cluster. (Quick definition for the non-datacenter crowd: a cluster is a networked group of GPU-equipped computers that split a single massive computation into smaller pieces, runs them in parallel, and reassembles the result.) If your GPUs can’t fit on one BallSat, they can’t cooperate. ISLs cannot provide enough throughput to bridge clusters on different satellites. Full stop.

So What Fits on a BallSat?

The realization that you cannot bridge between clusters on different BallSats fundamentally reframes the entire problem. In Part 1, I imagined vast constellations of BallSats working together in some kind of orbital GPU hivemind. Beautiful vision. Physically impossible because of comms. If you’re not convinced of this yet, I can’t help you.

Instead, we’re stuck with whatever GPUs we can cram onto a single BallSat. So let’s size the bird for each business model and see what the physics allows.

For AI Training, bigger is better, hence the term “hyperscaler.” We have yet to reach the upper limit for scale when it comes to AI training. We need to put as many GPUs as possible onto our BallSat. That number lands around 128 GPUs per bird. Sixteen NVIDIA DGX-class nodes, each with 8 GPUs connected by NVLink internally. Between nodes on the same satellite, they use standard networking (100–400 Gbps Ethernet or InfiniBand). Across satellites? Garden hose.

Can a 128-GPU satellite realistically train a 7-billion parameter model? Easy. Fits on 1–4 GPUs. The other 124 twiddle their thumbs.

How about a 70-billion parameter model? Doable. Sixteen to thirty-two GPUs. The satellite’s NVLink nodes handle the heavy lifting within each 8-GPU group, and on-board networking handles the pipeline boundaries.

A 175-billion parameter model, GPT-3 class? Nope. Needs 384+ GPUs. BLOOM used 384 A100s across 48 nodes. That’s three of our BallSats that can’t talk to each other fast enough to cooperate.

Frontier models? Forget it. Those jobs consume 10,000 to 100,000+ GPUs in tightly coupled clusters with sub-microsecond latency. Our constellation could have a million GPUs and they’d still be useless for this workload. So Musk’s grand vision of training frontier AI in space? Our BallSat can train… a model from 2022. Groundbreaking.

AI Inference flips the script. Independent inference requests can be load-balanced across completely disconnected GPUs with only standard networking, sidestepping the ISL bottleneck entirely. A single GPU handles 7B–13B model inference. Four to eight GPUs cover 70B models. Larger models (GPT-4 class) need 128+ GPUs but can be partitioned across nodes. Each satellite serves requests independently; adding more satellites increases throughput linearly. No tight coupling required. 64 GPUs per satellite, one full inference node per bird.

Public Cloud customers want three things satellites can’t deliver well: elastic scaling (spin up 8 GPUs now, 256 tomorrow), low latency, and hardware variety. The typical enterprise customer rents 1–8 GPUs for development, 16–64 for production, and 256+ only for major runs. We can’t offer that top tier (ISL constraint), and the duty-cycle-limited ground downlink means intermittent connectivity. You’re locked into one hardware config until the satellite deorbits. Still, 64 GPUs gives us a mid-tier offering to test against the economics.

Sovereign cloud and edge/CDN round out the lineup. Sovereign Cloud deployments focus on data sovereignty and security for classified or national-security workloads. The value proposition here is that data never touches foreign soil and legal jurisdiction follows flag-of-registration. A single satellite carries 64 GPUs, enough for the batch intelligence and translation pipelines that dominate this market. Edge/CDN sits at the opposite extreme. Cloudflare runs roughly 13 open-source AI models per GPU across 300+ cities. Akamai uses single RTX GPUs per edge location. These workloads are compute-local, bandwidth-light, and geographically distributed. A satellite constellation mirrors that architecture naturally. 8 GPUs per bird, maximum flexibility.

Use CaseGPUs/BallSat
AI Training128
AI Inference64
Public Cloud64
Sovereign Cloud64
Edge/CDN8

Five Use Cases Enter. Most Don’t Survive.

OK. Training mega-models in space remains bupkis. We established that in Part 1. The comms analysis just slammed the coffin shut and welded it closed.

But what about other workloads? Cloud compute serves many masters. Maybe one of them doesn’t mind the constraints of orbital life.

Five contestants. One BallSat. Let’s see who gets voted off the island.

Episode 1: AI Training Gets the Boot

We already killed this one. Twice. The bandwidth gap alone (36–144×) makes cross-satellite gradient synchronization physically impossible. The latency gap (10,000× worse than InfiniBand) buries it further. And the economics scatter the ashes. Training drowned in the ocean before it reached the island. Goodnight, sweet prince.

Episode 2: AI Inference Takes a Swim

This one hurts, because inference actually has the right physics.

Unlike training, inference requests are independent. When you ask ChatGPT a question, your query goes to one GPU (or a small group of GPUs sharing a model via tensor parallelism within a single node). It doesn’t need to synchronize with thousands of other GPUs. It computes your answer. It sends it back. Done.

This means inference can scale horizontally across disconnected GPUs. Each BallSat serves requests independently. Add more BallSats, serve more users. No tight coupling required. The bandwidth between BallSats becomes irrelevant because each bird handles its own workload. According to research from Introl, load balancing inference across thousands of disconnected GPUs works well with basic networking infrastructure.

So inference passes the architecture test. Wonderful. Where does it die?

The coverage problem.

Your inference API needs to be… available. As in, “a user sends a request and gets a response.” For that to happen, the BallSat needs a communication path to the ground. This is called the ground contact duty cycle, or the fraction of time the satellite can communicate with a ground station. Our Spaceballs-1 model had 60 minutes per day. That’s a 4.2% duty cycle. An inference API that works 4.2% of the time would make a dial-up modem blush.

“So build more ground stations!” Sure. Here are our options:

Option A: A handful of dedicated ground stations (3-5 stations). Gets you to roughly 40-60% duty cycle. Better. Still means your API drops out for hours at a time. Your SLA reads like a snow forecast. “Service will probably possibly work this afternoon, 60% of the time. No guarantees though!”

Option B: A global ground station network (15-20 stations). SpaceX-level infrastructure spread across multiple continents. Gets you to 85-95% duty cycle. Costs hundreds of millions of dollars in ground infrastructure. I thought the whole point of space-based compute was to avoid building and powering stuff on the ground?

Option C: ISL mesh relay. Route all traffic through a chain of satellites until you reach one that can see a ground station. Starlink does this for internet service and it works beautifully. For inference latency? Each hop adds 3-4 ms. Route through five to ten satellites and you’ve added 15-40 ms before the GPU even starts thinking. For real-time chatbot-style inference, users notice.

Option D: GEO bent-pipe relay. Park a relay satellite in geostationary orbit (35,786 km up) and bounce all traffic through it. Continuous coverage, problem solved. Except…

The signal path goes ground station → GEO relay satellite → LEO compute satellite → process → LEO satellite → GEO relay → ground station.

Leg 1, ground to GEO: 35,786 km ÷ 299,792 km/s = 119 ms. Leg 2, GEO down to LEO: 35,236 km ÷ 299,792 km/s = 118 ms. Leg 3, back up to GEO: 118 ms. Leg 4, GEO down to ground: 119 ms.

Total propagation delay: 474 ms. Round trip. Before the GPU computes a single floating-point operation. That signal path covers roughly 142,000 kilometers of vacuum at light speed, and no amount of clever engineering can compress the geometry when your relay station parks itself at geostationary altitude.

Half a second of latency just from the relay geometry. Add compute time and ground networking and you’re pushing 600-700 ms for first token. Your chatbot now responds like it’s contemplating the meaning of life before answering “What’s 2+2?”

And then the economics. Our model calculates a break-even price of $6.48 per GPU-hour for space-based inference. The terrestrial market rate sits around $5 per GPU-hour. Annual ROI is negative 81%. For every dollar you spend, you get back 19 cents.

Inference works the same way my New Year’s resolution to run a marathon works. The coverage problem kills the user experience, and the economics bury the corpse.

Episode 3: Public Cloud, Mercifully Quick

This one I’ll keep short. It deserves a short death.

Public cloud customers expect three things: elastic scaling (spin up 8 GPUs now, 256 tomorrow), low latency (sub-50ms round trip), and rich instance variety (pick your GPU, your memory, your networking).

Our satellite offers a fixed 64-GPU configuration, latency dependent on orbital mechanics and relay architecture, and one hardware config until the satellite deorbits in five years. You cannot scale up. You cannot scale down. You cannot swap in a newer GPU when NVIDIA ships the B300x-e-L-s-w-nimbus-4000. Your cloud has the elasticity of a brick.

Bandwidth: FAIL. Cloud customers expect 10+ Gbps of network per GPU. Standard stuff. An AWS p4d instance gives you 400 Gbps of networking shared across 8 GPUs. Our satellite delivers 1.56 Gbps per GPU via ground link. That’s 0.16× the requirement. Six times too little.

Economics: FAIL. Same $6.48/GPU-hr break-even, competing against AWS, Azure, and GCP, three companies with decades of operational optimization, millions of customers amortizing infrastructure costs, and ground-based datacenters that cost a twentieth of what a satellite costs per GPU.

Public cloud in space. It’s like opening a Blockbuster in 2024. The product doesn’t match what customers want, the price doesn’t match what they’ll pay, and the competition has a thirty-year head start.

Episode 4: Edge/CDN, the Heartbreaker

Now this one. This one stings. Because edge computing in space actually makes technical sense.

Edge workloads are small. Cloudflare runs 13 open source AI models per GPU across their 300+ global edge locations. Akamai’s edge compute uses a single RTX GPU per location. The typical edge deployment is one to four GPUs doing inference on small models, transcoding video, caching content.

Our bandwidth requirement is 0.5 Gbps per GPU. Available bandwidth is 1.56 Gbps per GPU. PASS, with a comfortable 3.1× ratio. The first bandwidth PASS for any use case.

Latency? LEO at 550 km altitude means roughly 9 ms round-trip for direct ground contact. Edge workloads tolerate up to 20 ms. PASS.

The compute architecture fits perfectly, too. Edge workloads don’t need inter-GPU coupling. Each satellite processes independently. LEO satellites distributed across orbital planes create a natural global edge mesh, which literally is what edge computing wants to be. The geography baked into orbital mechanics matches the geographic distribution that edge compute needs.

So why is edge eliminated?

Money.

Cloudflare charges $0.02-0.08 per GB for CDN delivery. Akamai’s pricing sits in the same ballpark. These companies operate on razor-thin margins because CDN and edge compute are commodities. The product differentiator between CDN providers comes down to pennies per gigabyte and milliseconds of latency.

Our model shows edge/CDN annual ROI at negative 72%. To match terrestrial CDN economics, we’d need to cut satellite costs by an order of magnitude. And even then, we’d still be competing with companies that have two decades of terrestrial infrastructure already deployed and paid for.

Edge is the one that hurts the most. It’s the technically perfect solution to a problem nobody will pay to solve from orbit. Like a beautifully engineered bridge to a place nobody wants to visit.

Episode 5: Sovereign Cloud. The Last One Standing.

And then there was one.

I almost didn’t include sovereign cloud in the original model. Seemed too niche. Too small-market. Too “what even is sovereign cloud?” for an article about Spaceballs.

Glad I did. Sovereign cloud is the only use case where the physics and the economics align. That’s right ladies and gentlemen, you heard it here first. Sovereign Cloud works in space.

Bandwidth: PASS. Sovereign compute workloads are batch-oriented. Government agencies processing classified documents. National AI models running translation pipelines. Intelligence analysis churning through intercepted communications. These jobs don’t need real-time streaming bandwidth to each BallSat. They need enough pipe to upload the job queue, download the results, and occasionally push a model update.

The bandwidth requirement for sovereign batch processing is roughly 1 Gbps per GPU. Available bandwidth in our model is at least 1.56 Gbps per GPU via ground downlink. The ratio is 1.56×. PASS.

Latency: PASS. And not just barely. Our satellite achieves roughly 9 ms round-trip latency for direct ground contact. Sovereign workloads tolerate up to 500 ms. We’re 55 times faster than the requirement. Latency is so irrelevant here it’s almost embarrassing to include in the analysis.

Why? Because sovereign customers run batch jobs. Upload a thousand classified PDFs. Come back in an hour to grab the results. Latency matters for the data transfer but not for the end-user experience. And for batch processing, the ground contact duty cycle barely matters either. A 30% duty cycle (achievable with 2-3 dedicated ground stations in the customer’s territory) still delivers 324 terabytes of data movement per day through a 100 Gbps downlink. For 64 GPUs running batch inference that’s more than enough to keep them saturated around the clock.

Scale fit: PASS. A typical sovereign deployment runs 8-64 GPUs. Our satellite carries 64. One satellite per customer deployment. No multi-satellite coordination needed. No ISL bandwidth problem. Each customer gets their own dedicated, isolated orbital compute node.

And now the economics.

Sovereign cloud commands a premium. A big one. Every failed use case exposed the same root cause: space infrastructure costs too much for markets where terrestrial providers have already spent decades grinding their margins down to commodity levels. Sovereign cloud escapes that trap. While standard cloud GPU-hours sell for $2-5, sovereign cloud customers pay $15-25 per GPU-hour and sometimes more for classified workloads. They’re paying for a guarantee that their data stays within their control, their jurisdiction, their security perimeter. That guarantee carries real value when you’re a nation-state processing state secrets.

Our model’s break-even price is $6.48 per GPU-hour. The sovereign market rate is $25 per GPU-hour.

Margin per GPU-hour is $18.52. Annual revenue per satellite is $3.77 million. Annual cost per satellite is $3.26 million. Annual ROI is 15.7%. Your payback period ends up being 4.3 years.

With a BallSat mission life of five years, you’re in the black before the bird deorbits.

Houston, we have a business model!

I stared at these numbers for a while. Checked them twice. Had Claude rebuild the model from scratch and verify every formula. The math holds.

One use case out of five. Sovereign cloud, alone and blinking in the wreckage of all the other dead business models, somehow turns a profit.

Why Sovereign Survives

What’s happening here is subtle but critical for business strategy.

Every other use case fails because space-based compute competes with terrestrial compute on performance. Speed. Bandwidth. Elasticity. Cost per FLOP. Terrestrial wins every single one of those fights. Ground-based datacenters run faster, cost less, flex more, and connect better. No contest. Trying to beat AWS on price from orbit is like trying to outswim Michael Phelps while wearing a snowsuit.

Sovereign cloud plays a different game entirely. The sovereign customer doesn’t buy performance. The sovereign customer buys a property of the deployment itself. Physical inaccessibility. Jurisdictional isolation. An air gap enforced by 550 kilometers of vacuum. A government agent with a warrant can walk into a datacenter in Frankfurt. An intelligence operative with the right creds ($5,000 in a paper bag) can access a server room in Singapore.

A GPU in orbit? Nobody’s accessing that. The physical security is enforced by the most effective perimeter in human history.

And today the market values that property at a 5× premium. Add “extraterrestrial technology” to your marketing collateral and you can probably bump that to 6×.

Conclusion

So the only viable space datacenter is a sovereign one.

I started this project to make fun of Elon Musk. I’m still making fun of Elon Musk. His plan to build an AI training datacenter in space remains complete and utter crap.

But somewhere between the thermal equations and the bandwidth calculations, something unexpected emerged. The punchline turned into a real question. The real question turned into an insight about Sovereign Cloud, which is not a space story. It’s a cyber story. And cyber stories are my thing.

So stay tuned for Spaceballs the Datacenter 3: Cyber Goes Plaid