updating my priors
2579 stories
·
3 followers

Giving software away for free

1 Share

If you want to create completely free software for other people to use, the absolute best delivery mechanism right now is static HTML and JavaScript served from a free web host with an established reputation.

Thanks to WebAssembly the set of potential software that can be served in this way is vast and, I think, under appreciated. Pyodide means we can ship client-side Python applications now!

This assumes that you would like your gift to the world to keep working for as long as possible, while granting you the freedom to lose interest and move onto other projects without needing to keep covering expenses far into the future.

Even the cheapest hosting plan requires you to monitor and update billing details every few years. Domains have to be renewed. Anything that runs server-side will inevitably need to be upgraded someday - and the longer you wait between upgrades the harder those become.

My top choice for this kind of thing in 2025 is GitHub, using GitHub Pages. It's free for public repositories and I haven't seen GitHub break a working URL that they have hosted in the 17+ years since they first launched.

A few years ago I'd have recommended Heroku on the basis that their free plan had stayed reliable for more than a decade, but Salesforce took that accumulated goodwill and incinerated it in 2022.

It almost goes without saying that you should release it under an open source license. The license alone is not enough to ensure regular human beings can make use of what you have built though: give people a link to something that works!

Tags: open-source, heroku, webassembly, javascript, web-standards, html, github, pyodide

Read the whole story
jsled
10 days ago
reply
South Burlington, Vermont
Share this story
Delete

Curtis Yarvin Fears His Authoritarian Fantasy Is Flopping

1 Comment and 2 Shares
Curtis Yarvin Fears His Authoritarian Fantasy Is Flopping

Advisory: This post discusses genocide and mass murder.

The Point: Yarvin Sees Doom and Failure Ahead

The accelerated destruction of the United States, the global economy, and the modern liberal democratic order is underway. Yet Curtis Yarvin – Peter Thiel’s “house philosopher,” who has been advocating such extremism for years – is disappointed. The San Francisco software programmer behind the so-called Dark Enlightenment doesn't seem to be enjoying the fruits of his own revolution. In fact, he sees failure ahead.

The Back Story: Fear and Loathing in Washington D.C.

It’s April 2025 and President Elon Musk – the CEO-dictator of the United States – is busy dismantling government. This demolition comes straight out of Yarvin’s playbook, which calls for a CEO-dictator to conduct a mass purge of government employees.

‘Reboot’ Revealed: Elon Musk’s CEO-Dictator Playbook
In 2022, one of Peter Thiel’s favorite thinkers envisioned a second Trump Administration in which the federal government would be run by a “CEO”
Curtis Yarvin Fears His Authoritarian Fantasy Is Flopping

Though regarded by many as a mere internet troll, Yarvin – who Vice President J.D. Vance affectionately calls a “reactionary fascist” (and often quotes) – is now getting his due. His extreme theories on replacing democracy with corporate dictatorship are finally being tested in Washington. The Financial Times, the New York Times and Time Magazine have all paid homage to his role in shaping our current reality.

But there’s a problem: Yarvin is unfulfilled. Musk’s destruction of government apparently does not meet the vaunted Dark Enlightenment guru’s standards. In fact, he gives the effort a mediocre grade and says it will likely fail:

After six weeks, is Trump 47 going well? It is and it isn’t. Frankly, I give it a C-. While still far below its potential, it at least has not failed. Which is frankly amazing.

What is frustrating about this administration is that it has the opportunity to win and the strength to win, but neither (it seems) the will or the understanding to win. So, it’s going to lose. But it is not yet fated to lose.

The key issue: Yarvin doesn’t think Musk has the competence to carry out a revolutionary transformation of American government. “Not only can Elon not manage an operation on this basis – God himself could not manage it,” he writes.

Despite the ongoing destruction in Washington, Yarvin complains it’s simply not enough: 

Unless the spectacular earthquakes of January and February are dwarfed in March and April by new and unprecedented abuses of the Richter scale, the Trump regime will start to wither and eventually dissipate. It cannot stay at its current level of power—which is too high to sustain, but too low to succeed. It has to keep doing things that have never been done before. As soon as it stops accelerating, it stalls and explodes.

Yarvin is a long-winded writer. This paragraph comes from a bloated, 7,000-word monstrosity titled “Barbarians and Mandarins,” published on March 6. A competent editor would slash it by 90 percent – maybe more.

The essay brims with false dichotomies, logical inconsistencies, half-baked metaphors, and allusions to genocide. It careens from Romanian tractor factories to Harvard being turned “into dust. Into quarks” with the coherence of a meth-addled squirrel.

Yarvin's fetish for authoritarian governance matches his prose: undisciplined, self-indulgent, and ultimately impotent. But I digress...

Analysis: Three Causes of Yarvin's Disappointment

Allow me to summarize Yarvin’s whiny gobbledygook in three main points:

1. He's upset that Musk/Trump aren't being authoritarian enough

While most Americans are shocked by how the current administration is breaking norms, Yarvin thinks it’s a middling effort. He wants full dictatorship:

First, the government needs to be run top-down from the Oval Office. This is why we call it the “executive” branch. “Executive” is a literal synonym of “monarchical”—from “mono,” meaning “one,” and “archy,” meaning “regime.” “Autocratic” is fine too. The “executive branch” is the “autocratic branch,” or should be if English is English. Libs: if these words don’t mean what they mean, what do they mean?

According to Yarvin:

Power creates power. The more power you use, the more power you have...

In his view, half-measures are worse than nothing at all. Musk/Trump must increase the severity of “things that have never been done before” or watch their regime “wither and eventually dissipate.” The real point, he declares, is to “take power from the libs, then keep it.”

Apparently, Musk/Trump are falling short:

Unless the monarch is ready to actually genocide the nobility or the masses, he has to capture their loyalty – or, in liberal parlance, obtain their consent. That’s just how it works. You’re not going to foam these people, like turkeys with bird flu. Right? That means your only option is to convince them to love you, right?

Notice the casual mention of genocide. For good measure, he links to the Wikipedia entry for foam depopulation:

Foam depopulation or foaming is a means of mass killing farm animals by spraying foam over a large area to obstruct breathing and ultimately cause suffocation. It is usually used to attempt to stop disease spread.

Yarvin has a penchant for evoking frames of genocide and violence (even though he looks like a guy who’d surrender his lunch money with nary a whimper). He stops short of calling for literal genocide – but keeps reaching for the language of it: mass death, institutional purging, systemic annihilation. And he rarely seems to notice – or care – just how grotesque that really is.

More on this is the next section.

2. He thinks the entire system needs to be “cremated” – and then rebuilt – not reformed

Yarvin has no patience for simple budget cuts and bureaucratic layoffs. He wants total destruction, followed by a radical rebuilding. Or something like that.

“DC does not need better policies,” he insists. “It needs a complete reboot—as complete as the denazification of Germany in 1945.”

For example: "Every existing institution of science, outside the scientists and the labs themselves, must be fully cremated in a nuclear autoclave."

Again with the flippant use of violent imagery. First, foaming, now crematories. Are we talking about a Holocaust?

Yarvin again uses a mass murder metaphor to describe the firing of employees at the National Science Foundation and National Institutes of Health:

 Someone in DOGE hacked the law (hacking is good, taking dramatic actions is good) by realizing that a certain class of administrative employees in NSF and NIH, so-called ‘probationary employees,’ could be legally shot without a trial. A review of unused drainage ditches in Bethesda showed adequate excess capacity. DOGE acted. The customer service records show few or no complaints about seepage, odors, etc…

Seepage and odors? Not only does Yarvin use a killing metaphor to describe the fates of federal employees – he also fantasizes about them rotting in mass graves. A healthy mind does not do such things, and these regular invocations of death and violence say much about Yarvin. He bathes his brain in the putrefying flesh of imaginary bureaucrats. Oh, and – in a positively flaming Freudian backflip – his parents were career federal employees.

Yarvin contradicts himself often. For example, he wants everything nuked – buildings included. Yet he's oddly miffed that Musk/Trump want to shut down the Department of Education:

Why do you want to shut down the Department of Education? Don’t you want to be the one writing the “Dear Colleague” letters? Doesn’t education in the US need to be completely rebooted – from kindergarten to university? How else are you going to do that, except with all the dotted lines that come out of the Department of Education?

Apparently, it's very frustrating to be a reactionary fascist theorist without also being a reactionary fascist dictator. If you want authoritarianism done right, you must do it yourself.

3. He's learning that “revolutionaries” make terrible administrators

After years of theorizing about the overthrow of democracy, Yarvin has discovered that turning theory into practice is much harder than it looks.

Those of us with experience in government and politics already knew this. Getting a single, simple change can take years. Overthrowing the entire U.S. government? Not a task for addled, inarticulate amateurs.

Yarvin seems shocked that Musk (with zero government experience except for a reliance on government contracts) is doing such a poor job. He also seems chagrined at Trump's specific actions, like tariffs, which are throwing markets into chaos. In political terms, the whole thing is a total disaster. By early 2025, millions of Americans are marching in streets and even Republican voters are showing up at town halls to express seething anger.

Yarvin seems to have come to a stark realization: If Musk/Trump fail to destroy the democratic system, extreme anti-government pseudo-intellectuals may face a bleak future. He urgently warns them against half-assing the revolution – and reveals his own fear of what might happen when the pendulum swings in the other direction:

But in the end, it is the road toward winding up in the foam yourself—probably with me, for all my troubles. (This is what usually happens to right-wing intellectuals, actually.)

Conclusion: Yarvin Fears The Foam

Yarvin sees his desired revolution unraveling under the weight of its own stupidity. He pines for intensified destruction, but it's not clear that will happen – at least not to the extent he considers necessary. So, in creeps the fear.

In early 2025, Yarvin is coming to terms with the towering mediocrity of his ideas. His complaints about the CEO-Dictator system prove exactly why it's inferior to American democracy. Despotism is a dog with no name. Once you let it go, you can't call it back.

Ironically, Yarvin may soon find himself grateful for the very institutions – democracy, bureaucracy, and laws – he enjoys deriding. Our liberal system, imperfect as it may be, does not stack its citizens in mass graves for having terrible ideas.

The only foam Yarvin needs to fear is the flecks on the corners of his mouth. May they serve as cruel reminders of the hysteria that led him here.


Stay Connected With The Nerd Reich Newsletter



Read the whole story
jsled
30 days ago
reply
South Burlington, Vermont
Share this story
Delete
1 public comment
tante
30 days ago
reply
"In early 2025, Yarvin is coming to terms with the towering mediocrity of his ideas. His complaints about the CEO-Dictator system prove exactly why it's inferior to American democracy. Despotism is a dog with no name. Once you let it go, you can't call it back."
Berlin/Germany

An Open Letter to ICE Regarding My Potential Disappearance

1 Share

Dear ICE official(s),

I noticed you recently detained your first Iranian foreign national. As a first-generation Iranian American, I’ve been conditioned to assume this is a testament to our great Persian culture. At least I’m sure that’s what my dad will say. Is he from Shiraz?

I just had a few questions ahead of any potential deportation and/or the disappearance of myself or my family members.

I know a lot of people might criticize your lack of a formal DEI initiative, but I want to commend the diversity of the first batch of students you’re detaining. It’s like the cast of The Sex Lives of College Girls, if you swapped the white girls for a South Korean and a Palestinian.

Speaking of college, are you deporting only Western, South, and Eastern Asians (and Muslim Africans) with impressive educations? I noticed that, thus far, most of them are pursuing their PhDs. If it helps, despite pleas from my dad, I have no interest in being an MD or PhD.

I also noticed the students are in fields like mechanical engineering or have elite credentials like being a Fulbright Scholar. Again, my dad really wanted me to pursue that path. “Just get a business degree,” he told me. “What about being a lawyer?” he asked. I think he would have settled for a minor in business. But I just got an English degree, and my GPA was not great. Does that help my case?

Also, are you deporting only these legal residents with ties to college campuses, or do you plan on expanding to other places, like cultural centers or the Halal Guys?

I guess what I’m wondering is, are you super committed to universities, and what would you consider a “tie” to a university? If I go to a college campus once a week to take my kids to piano lessons, will you abduct me there, or is this really more to instill fear in all of the “good” immigrants that come to the United States to share their talents here instead of staying at home? Like, is this just an attack on the brain drain that benefits America?

I want to emphasize again that I only have an English degree. Yes, I ended up getting a master’s, but that was in social work, so again, really nothing to see here.

Speaking of preparation, I noticed you’re starting to use plainclothes agents. Bold move. Way to instill fear. Is there any sort of uniform hoodie we should be aware of, or just generally be afraid of any hooded white guys? I actually was already afraid of them, but honestly, I usually felt okay in broad daylight in a public place. Thank you for reminding me that I was never really safe.

Hypothetically speaking, if my Iranian dad had a relationship with a blonde American from the Midwest, thereby ending in my conception, will my 23andMe DNA results be taken into account when choosing what country to deport me to? Or do you detain/deport/disappear based on the highest percentage?

I deleted my 23andMe account out of concern for my data privacy, but on the off chance it will help me make my case, let me go ahead and disclose I’m actually only 49.7 percent Iranian. (I’m 0.5 percent Ashkenazi Jewish, so I physically can’t be antisemitic.) I’m 40.1 percent British and Irish.

If it’s possible to make a request, I wouldn’t mind Ireland. I know it’s not perfect, but they have access to abortion. Also, they speak out against genocide. Of course, I guess that’s what made me a target in the first place.

Thanks for your help,
Saba Khonsari

Read the whole story
jsled
37 days ago
reply
South Burlington, Vermont
Share this story
Delete

A Modern Approach to Hit Points and Communicating Damage

1 Share

The relationship between game designers and hit points is a complex one. Whether in digital or pen-and-paper, hit points tend to be the structural unit that defines a player character or obstacle’s resilience, and has become how we conceptualise and communicate damage and interactive effects between designer and player. And frankly, I find that a little uninspiring, and sometimes, a little predatory.

This describes a slightly orthogonal approach to hit points, and ends with a design challenge that attempts to remove hit points entirely.

Hit Points in Use

There are three main uses for Hit Points within a game:

  1. Increasing time-to-kill

  2. Granularity

  3. Comparative abstraction

Increasing time-to-kill is about putting a step between the player interacting with a dangerous position, and ending a play session. In Call of Duty (Infinity Ward, 2003), a player steps out of cover, takes a few rounds, and returns fire. The game has effectively communicated risk and danger, and “punished” the player without ending play. This use of hit points to extend a play session was part of their original design intent by Dave Arneson in drafting first editions of Dungeons and Dragons (TSR, 1974): “a chance to live longer and do more.” Hit Points provide a way for a “violent-state” world to interact negatively upon a character without removing the character from play. In the same way, increased hit points let enemies stay around longer, requiring more interactions from the player to change the playstate.

Granularity is about giving a data set more “steps” to pass through in order to differentiate states. For example, a character with 100 hit points theoretically has 101 states to pass through (including 0 hit points). This gives game designers a dial to tweak: an attack that does 70 damage is fundamentally different to the play experience than an attack that does 30 damage. This is why the old design adage warns us away from 1s and 2s: They remove granularity of playstate. They also interact with the next usage:

Comparative abstraction describes the use of hit points as a way to express how things are diegetically or narratively different from each other. A paladin is “tougher” than a wizard because a paladin has 100 hit points, and a wizard has 20 hit points. A dragon is a “more powerful” baddie because the dragon does 6d6 damage to a players hitpoints, while a goblin does 1d6. By having a spread of numbers, we can describe things as “calculably different” or “different in scale”.

Granularity and Changing States

You may notice that only the first one in that list is a functional difference. The other two are structural differences. To show you what I mean, let’s imagine a game where one character has 2 hit points, and another character has 4 hit points, twice as tough. We meet both conditions for Granularity and Comparative Abstraction: We have dials, and those dials "say something” about the diegesis. However, these structural difference don’t mean anything if all enemy attacks do 100 damage. In all cases, the time-to-kill is immediate, and the characters are functionally identical. Hit points, though they may be coded into the game, are not a part of the functional player experience.

Let’s extend this one step further. Consider a game with two weapons: The starting minipistol, and the upgraded hand cannon that we’ll call the MEGAPISTOL! For ease, these weapons are identical except for their damage stat. If enemies have 100hp, and the minipistol does 50hp of damage, then the megapistol HAS to do 100hp of damage. This is because players don’t experience enemy hp as a number of hit points, they experience it as a number of interactions required to change the game state. If the minipistol (doing 50hp of damge) and the megapistol (doing 70hp of damage) both take 2 shots to kill an enemy, then they are functionally the same to the player. I acknowledge there are other dials to turn, like number of bullets or reload time, but I’m keeping this discussion along a single axis to discuss the use of hit points.

The takeaway lesson, then, is that hit point granularity (and thus, the comparative abstraction between different weapons or enemies) doesn’t mean anything on its own. We can’t have comparative abstraction by granularity of numbers, but rather comparative abstraction by granularity of game states. Players experience these game states by presence or absence, which means the addition of a measurable unit adds two game states: Presence and Absence. In the case of hit points, Alive or Dead. But a designer can define stages along that measurable unit to add additional states, and again because of presence and absence, it leaves us with this note: For every game state defined by a measurable unit, there is also a state defined by its absence.

The options provided by that definition of game state is a magical opportunity. I’m so frustrated by our obsession with a binary of “active vs inactive” game states. In Call of Duty I am as effective after taking one round as I am beforehand. In Dungeons and Dragons the dragon and the wizard are both dealing full damage until one side loses their final hit point. Now, for a game like Call of Duty, with a short time-to-kill (usually within a quarter of a second), this lack of granularity is perfect. Players cannot consider the change of game states within a firefight. Spray, as they say, and pray. This is not a tactical approach. However in Chess (yeah, I’m not putting a year here), the game state is usually measured with much finer comparative granularity. Material (how many pieces), Position (more activated pieces), Time (in timed games), or better Endgame availability are all the “hit points” of Chess. While a knight and a bishop are both valued at “3 points”, any player of experience would rate one higher at different states of play. The pace of the game allows that considered approach to comparative abstractions.

So why, in tactical games that have a more considered flow, like XCOM (2013), Wildermyth (2020), and yeah, Dungeons and Dragons, Fifth Edition (2014), do we not support the player in developing other interesting game states?

Additional States Generate Additional Play Experiences

Pew, pew. I shoot a laser beam from my sword, giving me and Link a ranged attack against these more dangerous enemies. Across the series (but starting at the start) The Legend of Zelda (1986) has included a game state where an Undamaged (full hit points) Link can shoot a beam from the sword. Placing a game state at the top end of hit points rewards mastery, and gives a low-risk bonus to players that are able to get through a level without taking damage.

I love Games Done Quick and the work they put in, and there’s an interesting state change in Pokemon Red (1996) that is only utilised in the speedrun. When a pokemon is on critically low health (“Red Bar”), the game preferences the two-tone health warning music over pokemon cries and level-up jingles. This creates a “faster state” where the following have to be true:

  1. The player has to take enough damage to be put into “red bar”

  2. The player cannot take enough damage to make their pokemon faint

  3. The player must maintain this state throughout subsequent fights.

Placing the beneficial game state at the lower end of health has created a high-risk, high-reward position that players will need expertise to juggle.

Opportunities to Consider

Doom (2016), and Doom Eternal (2020) meets Pokemon’s Red Bar

Doom lives and dies (pardon the pun) off a health system that drives the player forward. In Doom, players regain health by performing melee kills against weakened enemies.

AND
IT
IS
AWESOME!

I can’t speak highly enough of this Glory Kill System, but now I want to ask, what happens if we give Doom Guy a few states to pass through? Given the dynamic up-and-down bounce of health, I think it’s appropriate to utilise a single additional state, at the bottom end of the hit point pool, maybe the last 25%. When players are in this critically reduced health state, their damage is increased by 100% through a “berserk” feature.

The granularity dials of hit points interact with this too. Players on low difficulty, where monsters do less damage, will find it easier to remain in this “Red Bar” state take advantage of this benefit, giving players who are “Too Young To Die” the opportunity to feel powerful and have “clutch comebacks” against dangerous monsters. It does not, however, engage with comparative abstraction as only one character (the player character) is engaging with this mechanic.

However, because we’re placing it at the back-end of hit points, players on harder difficulties will find it more difficult to be safely put into the state, and more difficult to maintain it without dying, but where they can maintain it, will be able to take great advantage of the bonus it provides.

Major risks will be by making a change to health as an incentive, players may not be as willing to engage in Glory Kills and maintain the momentum the game holds so dear. Given that we don’t see players preserving health by hiding away from combat while at full, I think it is an unlikely outcome, but one to look out for nonetheless.

Dungeons and regaining dynamic states of Dragon health

The fourth edition of Dungeons and Dragons (2008) had a monster state called “bloodied”. All monsters would enter bloodied at the same state, defined by having half hit points remaining. This state gave descriptive granularity to the GM (players would want a fictional bark that described the character as wounded), and comparative abstraction as some monsters became more dangerous when bloodied, and some became less dangerous.

Given the exceptionally low time-to-kill in Dungeons and Dragons, I suggest that baddies may even have more states. Rather than using Bloodied as a binary on/off switch for abilities and recharges, there is the option for a state that is “Wounded”. Remember I said when measuring you can create a state by presence and absence. This could be a state where the monster was not at full hit points (ie where the “full health” state was absent); when the monster has taken some damage, but not enough to be considered something as critical as “Bloodied”. This provides an option for turning on or off early-round threats, or showing a monster ablating under the withering attacks of the Player Characters. A gorgon (the magical armoured cow version) may start with an Armour Class (AC) of 20, quite high. But after that first big hit, that ablates to 16, when Scathed. The extra granularity of states gives us an early bark, and gives the Gorgon an interesting early advantage to show it’s metallic resilience without dragging out a fight.

Four Damage States to Consider in Future Designs

Unhurt

A stage designed for alpha strike threats and ablative skills. Unhurt is defined as “not notably damaged", this includes not having taken damage, but also having received little scratches. A dragon with a sword through a scale may remain “unhurt” for extra comparative abstraction (“look how tough it is!”).

Unhurt is excellent for when the Designer wants to provide a high-interest threat that grabs attention immediately, where you want to draw player attention and make a splash, but not create an overpowering threat.

Alternatively, Unhurt provides an option for a slow, creeping threat. Something that doesn’t reveal its full hand of cards until the fight has already started.

Usage:
“The Ferret Armoured Scout Car has extra speed while Unhurt”.
”The Shapeshifter does not need to make checks to maintain its form while Unhurt”.

Bloodied/Scathed/Damaged

This is “noticably damaged”. “Shit, that hurt” but “I’ve had worse”. It’s a wound, it’s pain, but it’s mostly cosmetic damage. A secret agent with a round through the bicep. It’s Daredevil taking Elektra’s sai through his shoulder in the 2003 film and then fighting Bullseye minutes later with no noticeable consequence. I like shifting the term “Bloodied” earlier in the piece to increase the impact of combat. Bloodied, to me, is the villain with a split lip, tasting their own blood from their finger (or dramatically snapping their fingers). They’re not down, not even half, in fact, they’re just getting started.

Bloodied is an ideal state for showing the “adrenaline surge” of combat. You’re not going to wait until you’ve taken real damage to get that blood pumping, are you?

Usage:
”A Bloodied Barbarian adds 1d4 damage to their attacks as their humours get up.”
”A Bloodied player’s Hunger bar decreases more slowly.”

Wounded

Wounded is wrecked. Ruined. Lost limbs. For some characters this dips down on the power curve, for others it spikes up (comparative abstraction). Wounded should feel big, and depending on game tone, messy. Wounded is a great place for barks and VFX to be reactive of the situation a character has gone through: An orc ravaged by Legolas’ arrows should LOOK a different Wounded to one hit by Gimli’s axe.

Wounded is a state for showing things are nearing the end of their life, so don’t hold back on your design choices here. It’s a good place for sentient characters to break and run, or surrender like in Griftlands (Klei Entertainment, 2021). It’s also a great place for those big bloodlusty threats to dig in and fight with all of their remaining vigor.

Usage:
”The Predator gets -4 defence when Wounded - ‘If it bleeds, we can kill it.’”
”The ogre cannibal does double damage when Wounded.”

Dead

is dead. It’s an absence. It’s the punitive state or the win condition. It’s a bottom-out, rather than a functional state in itself.

Usage:
”You died” - Dark Souls (FromSoftware, 2011)

Conclusion

Players think in game states, they talk to each other in game states. Let’s let them play in game states. Numbers are a wonderful tool, that I will never begrudge the use of in games, but let’s not make them our primary communication method.

There is an argument to be made that this approach recreates HP, with my 4 states acting as a 4 hp system, and I can see what that’s saying, but it’s actually a result of hp being conceptualised as changing game state for so long. We’ve become conditioned to seeing the little white numbers pop up above that boss and consider it progress.

What I’m suggesting is not so much a change in structural approach, but a change in how we communicate these outcomes to players. Even if numbers remain in the game as the structural building blocks that make damage and health happen, I’m asking us to conceptualise damage as a changing condition of the player’s play experience, not as a changing condition of an abacus.

Design Challenge

If we have four functional states, do we need hit points at all? Let’s, as an exercise, take this to the furthest conclusion of replacing hit points with states entirely. Consider a game that uses hit points as it stands, and think about whether it could instead deliver the same experience with these states instead?

The answer won’t be yes for everyone, but as a design challenge, this will flex your understanding of how players functionally experience the dynamic movement of health for both goodies and baddies.

When you find a game that could do this, draft up some paper-prototype rules for how you would implement this, and review them using the first three elements we discussed:

  1. Time-to-kill - Does this dramatically change the flow of combat in this game? Does the new flow meet design intent?

  2. Granularity - Are there enough states to use as a dial to respond to player choices? Maybe there’s too many and you don’t need both Bloodied and Wounded? What else could you tune to make weapons feel different now that you can’t just add a “+4 damage” sticker and colour it blue?

  3. Comparative Abstraction - How do the states make enemies feel different to each other? Are Bloodied enemies functionally different for players than Unhurt enemies?

Sample thoughts:

XCOM could easily use Unhurt, Wounded, Bloodied, and Dead, maybe even maintaining its Bleeding out/Dead split that is rolled when a friendly operative reaches 0 hit points.

Some draft rules would include:
Phalanx characters (with shields) ignore changes of state from the front 180 degree arc.
A sniper rifle critical sets the target’s state to Dead.
All other crits double state movement.
Light weapons (pistols) do one state of damage
Assault weapons (rifles, smgs) do two states of damage
Heavy weapons do three states of damage.
A flanked enemy takes one additional state of damage.
A Faceless (big gooey tough enemy) requires two consecutive hits in a round to move from Unhurt

Maybe a fun additional mechanic called “cued shot”, where operators line their shot up with each other to break through defensive enemies? We could also tie this in with XCOM 2’s Bonds system that joins pairs of soldiers together as bffs.

Against our criteria:

  1. Time-to-kill remains roughly the same. An assault rifle with an appropriate tech level will move default enemies two states, which means two shots to kill, unless you crit. That’s about where it is. There’s some tweaking around an assault rifle crit doing 3 or 4 “states”, meaning one or two hits to kill, but I’d be happy to take that to playtesting.

  2. Granularity is I think maybe a little weak. XCOM is a GREAT test for this, because one of its most fun elements was enemy variety. However, that’s also a good teaching point for this kind of health system. A 4 hit point “Thin Man” (acid spitting guy in suit) is a functionally different enemy to a 4 hit point “Floater” (jetpack cyborgs who can zoom behind your cover to flank). They even use the same weapon (the light plasma rifle), but given the other dials left to turn, they still feel totally different to play against. For this reason I want to keep an eye on granularity during playtesting.

  3. Comparative Abstraction is a tough one to decide on my own because it’s so much about feeling. I can make guesses, but I might not be able to accurately predict a player’s understanding of the abstraction at play. I think one of the hardest abstractions to communicate will be increasing damage by tech level. In base XCOM this is simple to communicate: a floater has 4 health. A heavy floater has 14 health. Each still takes either one shot from a heavy weapon or two shots from light weapons (as player technology advances along the same rate as alien reveals) but the player can easily see that change just by the number of bars at the top. This would probably be won and lost in the missions between where a player upgrades their equipment and then when the player encounters Heavy Floaters (or, for the unfortunate souls, vice versa). That emotional impact still lands in base XCOM, and the advantage here is that it would be communicated with barks and vfx versus some white numbers floating above the enemy’s head.

Read the whole story
jsled
46 days ago
reply
South Burlington, Vermont
Share this story
Delete

How Many People Live Paycheck to Paycheck?

1 Share

Bernie Sanders is a big fan of citing troubling economic statistics. One of the figures that he features in his rotation is that 60 percent of Americans live “paycheck to paycheck.” This number consistently irritates certain wonks and so I’ve decided to do a deep dive into the controversy to see what I can make of it. In short, I’ve found that the phrase “paycheck to paycheck” is not consistently defined and that efforts to debunk the claim rely upon data that don’t convincingly do so.

Paycheck-to-Paycheck Surveys

LendingClub (60%)

The figure Bernie cites appears to come from the Paycheck-to-Paycheck report, which was a series of monthly reports put out by LendingClub between June 2021 and December 2023. The methods of this report are opaque. LendingClub claims to have surveyed around 2,500 to 3,000 consumers for each report, but the reports do not make clear whether they are simply asking people if they live paycheck to paycheck or deducing this in some way using personal financial information. Nonetheless, the LendingClub report found that 52 percent to 64 percent of consumers lived paycheck to paycheck during the months they surveyed.

LendingClub defines someone as living paycheck to paycheck if they have “no money left over after spending their earnings.” Put differently, for LendingClub, someone is living paycheck to paycheck if they currently have a low savings rate. Thus, as they explain in their June 2021 report, “one can have a good chunk of money in the bank as well as a good salary and still struggle to make ends meet.”

BankRate (34%)

Another estimate comes from a YouGov survey commissioned by BankRate. I could not find the precise question that was asked in that survey, but, unlike the LendingClub report, the BankRate write-up of the survey makes it clear that they explicitly asked people whether they were living paycheck-to-paycheck and 34 percent of workers answered that they were.

The BankRate write-up defines “paycheck to paycheck” this way:

The expression, “living paycheck to paycheck,” generally refers to having little or no money for savings left over from your paycheck after covering your regular expenses. You might be unable to pay your bills if you suddenly become unemployed or don’t receive the next paycheck.

The first sentence of this definition refers to currently having a low savings rate, but the elaboration in the second sentence refers to someone who has a low level of emergency savings to draw upon in the case of a negative income shock. These are not the same thing, though I suppose it is literally true that the former “might” lead to the latter. So, as best as I can tell, BankRate, like LendingClub, is using a low savings rate definition for the term, though it is unclear whether the survey respondents were asked the question with this definition or just asked more generally whether they live paycheck to paycheck.

Bank of America (50% or 26%)

Yet another estimate comes from Bank of America. In a survey meant to be representative of the US population, Bank of America asked how strongly people agree or disagree with the statement “I am living paycheck to paycheck.” Just under 50 percent of people answered that they strongly or somewhat agreed. In this part of the report, the authors state that the phrase “living paycheck to paycheck” can refer to “individuals or households that regularly spend nearly all of their income, leaving little to nothing left over for savings.” This is the low savings rate definition.

In this same report, Bank of America analyzed bank account data for a sample of their customers in order to determine how many of them spend 95 percent or more of their household income on necessity spending, which they define as “childcare, external credit card payments, gasoline, general retail, grocery, housing (mortgage/rent), insurance, cable TV/broadband, public transportation, tax payments, vehicle costs and payments.” Using this method, they conclude that 26 percent of people live paycheck to paycheck so defined.

Bank of America does not present the 26 percent figure as a debunking of the 50 percent figure. Instead, they point out that their lower number reflects their “focus on necessity spending,” ostensibly meaning they recognize that their particular definition of paycheck-to-paycheck living is very narrow. Another limitation of the 26 percent figure is that it comes from a sample of Bank of America customers, which are not representative of the overall population, and, even among those customers, they only have access to the income and spending information that those customers run through their Bank of America accounts. For instance, customers that are also doing necessity spending on a Citi credit card that they carry a balance for will end up miscounted.

This necessity-spending approach is seemingly intended to provide an extreme measurement aimed at establishing an absolute lower bound of possible estimates. Thus, the authors of the report conclude that, despite the extremity of their definition, which cuts the self-reported paycheck-to-paycheck figure in half, their findings “are significant and do suggest a relatively large proportion of households are living paycheck to paycheck.”

Conceptual Issues

Already in the three reports above, we see one of the problems with this discourse: the phrase “living paycheck to paycheck” is ambiguous. The Bank of America report calls this out explicitly, stating that the phrase is “somewhat nebulous and is not always clearly defined.” Across the three reports, we see at least three different ways of understanding the term:

  1. Low savings rate
  2. High necessity spending rate
  3. Low emergency savings level

These three things are related to one another, but not exactly the same.

There are also other complications with defining the phrase. For example, someone who currently lives off labor income and does not have enough money to retire could reasonably describe themselves as living “paycheck to paycheck.” How are else are they living? Certainly not “dividend to dividend.”

Should the phrase apply to retired people who, by definition, do not receive paychecks? Do students live paycheck to paycheck? Disabled people? Stay-at-home parents? Kids? On any given month, around 50 percent of the population does not work. This includes 40 percent of adults aged 18 and above as well as 25 percent of adults between the ages of 18 and 64. Do these people inherit the paycheck-to-paycheck status of their overall household or family unit? Or should we analyze them in a more individualized way?

I point this all out only to illustrate that, when we move away from looking at how people self-describe and start looking directly at personal financial data to determine how many people live paycheck to paycheck, there are lots of possible ways to do it, and many complications to work through.

Government Data

The most prominent critic of the idea that paycheck-to-paycheck living is very common is Matt Darling. His rebuttal typically consists of two figures from the Survey of Consumer Finances (SCF) and one survey question from the Survey of Household Economics and Decisionmaking (SHED). Both are government surveys conducted or sponsored by the Federal Reserve.

Survey of Consumer Finances

The first SCF figure Darling cites is median net worth, which in 2022, stood at $192,700. The problem with this figure is that it includes non-financial assets like one’s home and car and relatively illiquid financial assets like balances in retirement and education accounts. To address these problems, Darling provides a second SCF figure that more closely matches the financial resources people tend to think about as being available to them in the case of an emergency, i.e. their liquid financial assets. In 2022, median liquid assets stood at $7,850.

What’s interesting about this $7,850 figure is that Darling (and Ben Krauss) presents it as self-evidently debunking the paycheck-to-paycheck argument. But does it? The median income in the SCF data is $70,529. This means that median liquid savings is only 40 days of median income. If you would run out of liquid savings in 40 days, do you not live paycheck to paycheck? How few days does it need to be? 30 days? 20 days?

If we drop the elderly from the sample based on the observation that they are almost all retired and therefore not receiving any paychecks, the result is that median liquid savings fall to $6,700 and median income increases to $77,825. Thus, for the non-elderly who actually receive paychecks, liquid savings would only last 31 days at the median.

If we walk through each household and divide their liquid savings by their income to determine how many days of income they have in liquid savings, we can see what this looks like across the entire distribution (graph truncated at 80th percentile to keep the scale manageable).

Assuming the liquid-savings-to-income ratio is the right thing to look at, this is the best graph for assessing “paycheck to paycheck” status that there is. But where exactly do you draw the line for living paycheck to paycheck? At the 60th percentile, which is the figure Bernie likes to use, households have 57 days of savings. For non-elderly households, it’s 46 days of savings. The paycheck-to-paycheck claims just don’t seem that outlandish by this measure.

And this metric is using the SCF, which is a survey of household finances, not individual finances, and which defines households using a “primary economic unit” concept that effectively excludes any adult that is not the economically dominant individual or couple in their household. So, struggling young adults who live with mom and dad are not independently surveyed and are instead lumped in with their parents. Adjusting further for that would knock a few more days off all of these figures. So would weighting the results by individuals rather than households, as the former would end up counting minor children as among the population that lives paycheck to paycheck while the latter really does not.

Survey of Household Economics and Decisionmaking

The SHED figure that Darling likes to cite asks respondents the following question:

Have you set aside emergency or rainy day funds that would cover your expenses for 3 months in case of sickness, job loss, economic downturn, or other emergencies?

Around 54 percent of respondents answer yes while the other 46 percent answer no. From this, Darling concludes that at least 54 percent of Americans are not living paycheck to paycheck. But SHED asks these people other similar questions and the answers to those questions don’t line up well with the conclusion that these 54 percent of Americans can actually handle three months of expenses.

For instance, SHED asks the following:

Based on your current financial situation, what is the largest emergency expense that you could
handle right now using only your savings?

The below graph contains the distribution of answers to this question among the 54 percent who say they have three months of emergency expenses saved up:

So, 24 percent of the people who say they have three months of emergency savings also say they cannot afford an emergency expense of $2,000 or more. If we divide $2,000 by three months, we get $667 per month. In 2023, the poverty line for a single adult was $1,215 per month and not all of these people are single adults. Is it really the case that someone can afford three months of expenses if they don’t have enough savings to cover three months worth of deep poverty living? I am skeptical to say the least.

If we define someone as living paycheck to paycheck if they either say they do not have three months of emergency savings or say they cannot afford a $2,000 emergency expense, then SHED tells us 59 percent of American adults are living paycheck to paycheck, which is of course just 1 point shy of the Bernie-favored 60 percent figure.

If we exclude the retired from this calculation, since they don’t receive paychecks, or assign the paycheck-to-paycheck status of adults to their minor children (children are not counted in SHED), the number would go even higher than that.

Of course, this is all using the low emergency savings level definition of “paycheck to paycheck” living, which seems to be Darling’s preferred approach. The SHED also asks people whether, in the last month, they spent more than, less than, or about the same amount as their income, which aligns with the low savings rate definition of “paycheck to paycheck” living. Fifty-two percent of respondents say they spent more than or the same amount as their income.

Conclusion

Given the inherent ambiguities of the phrase, I am certainly not going to try to argue that there is an obvious “real” number out there for paycheck-to-paycheck living, nor am I going to vouch for self-reported answers to that or similar questions in surveys. But at the same time, the idea that it is most definitely not sixty percent, that Bernie is being obviously crazy in saying that, seems pretty silly, especially after you probe the SCF and SHED data that is meant to debunk the claim.

As always, it’s important to also keep in mind what exactly we are implying when we talk about living paycheck to paycheck. I am all for people having personal savings, but it’s also the case that, in a well-designed economic system, big financial shocks are smoothed over, not by one’s own personal assets, but through the welfare state. Large expenditures due to health problems should be handled by public health insurance. Income declines resulting from job loss or disability should be covered by unemployment and disability benefits. Economic security should not depend on an uninterrupted flow of paychecks and good health, but it also should not depend on building up large amounts of liquid assets.

Read the whole story
jsled
47 days ago
reply
South Burlington, Vermont
Share this story
Delete

Things we learned about LLMs in 2024

2 Shares

A lot has happened in the world of Large Language Models over the course of 2024. Here's a review of things we figured out about the field in the past twelve months, plus my attempt at identifying key themes and pivotal moments.

This is a sequel to my review of 2023.

In this article:

The GPT-4 barrier was comprehensively broken

In my December 2023 review I wrote about how We don’t yet know how to build GPT-4 - OpenAI's best model was almost a year old at that point, yet no other AI lab had produced anything better. What did OpenAI know that the rest of us didn't?

I'm relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board) - 70 models in total.

Screenshot of a comparison table showing AI model rankings. Table headers: Rank (UB), Rank (StyleCtrl), Model, Arena Score, 95% CI, Votes, Organization, License. Shows 12 models including GLM-4-0520, Llama-3-70B-Instruct, Gemini-1.5-Flash-8B-Exp-0827, with rankings, scores, and licensing details. Models range from rank 52-69 with Arena scores between 1186-1207.

The earliest of those was Google's Gemini 1.5 Pro, released in February. In addition to producing GPT-4 level outputs, it introduced several brand new capabilities to the field - most notably its 1 million (and then later 2 million) token input context length, and the ability to input video.

I wrote about this at the time in The killer app of Gemini Pro 1.5 is video, which earned me a short appearance as a talking head in the Google I/O opening keynote in May.

Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.

Longer inputs dramatically increase the scope of problems that can be solved with an LLM: you can now throw in an entire book and ask questions about its contents, but more importantly you can feed in a lot of example code to help the model correctly solve a coding problem. LLM use-cases that involve long inputs are far more interesting to me than short prompts that rely purely on the information already baked into the model weights. Many of my tools were built using this pattern.

Getting back to models that beat GPT-4: Anthropic's Claude 3 series launched in March, and Claude 3 Opus quickly became my new favourite daily-driver. They upped the ante even more in June with the launch of Claude 3.5 Sonnet - a model that is still my favourite six months later (though it got a significant upgrade on October 22, confusingly keeping the same 3.5 version number. Anthropic fans have since taken to calling it Claude 3.6).

Then there's the rest. If you browse the Chatbot Arena leaderboard today - still the most useful single place to get a vibes-based evaluation of models - you'll see that GPT-4-0314 has fallen to around 70th place. The 18 organizations with higher scoring models are Google, OpenAI, Alibaba, Anthropic, Meta, Reka AI, 01 AI, Amazon, Cohere, DeepSeek, Nvidia, Mistral, NexusFlow, Zhipu AI, xAI, AI21 Labs, Princeton and Tencent.

Training a GPT-4 beating model was a huge deal in 2023. In 2024 it's an achievement that isn't even particularly notable, though I personally still celebrate any time a new organization joins that list.

Some of those GPT-4 models run on my laptop

My personal laptop is a 64GB M2 MacBook Pro from 2023. It's a powerful machine, but it's also nearly two years old now - and crucially it's the same laptop I've been using ever since I first ran an LLM on my computer back in March 2023 (see Large language models are having their Stable Diffusion moment).

That same laptop that could just about run a GPT-3-class model in March last year has now run multiple GPT-4 class models! Some of my notes on that:

This remains astonishing to me. I thought a model with the capabilities and output quality of GPT-4 needed a datacenter class server with one or more $40,000+ GPUs.

These models take up enough of my 64GB of RAM that I don't run them often - they don't leave much room for anything else.

The fact that they run at all is a testament to the incredible training and inference performance gains that we've figured out over the past year. It turns out there was a lot of low-hanging fruit to be harvested in terms of model efficiency. I expect there's still more to come.

Meta's Llama 3.2 models deserve a special mention. They may not be GPT-4 class, but at 1B and 3B sizes they punch massively above their weight. I run Llama 3.2 3B on my iPhone using the free MLC Chat iOS app and it's a shockingly capable model for its tiny (<2GB) size. Try firing it up and asking it for "a plot outline of a Netflix Christmas movie where a data journalist falls in love with a local ceramacist". Here's what I got, at a respectable 20 tokens per second:

MLC Chat: Llama - [System] Ready to chat. a plot outline of a Netflix Christmas movie where a data journalist falls in love with a local ceramacist. Show as Markdown is turned on. Here's a plot outline for a Netflix Christmas movie: Title: "Love in the Clay" Plot Outline: We meet our protagonist, JESSICA, a data journalist who has just returned to her hometown of Willow Creek, a small, charming town nestled in the snow-covered mountains. She's back to work on a story about the town's history and the effects of gentrification on the local community.

Here's the rest of the transcript. It's bland and generic, but my phone can pitch bland and generic Christmas movies to Netflix now!

LLM prices crashed, thanks to competition and increased efficiency

The past twelve months have seen a dramatic collapse in the cost of running a prompt through the top tier hosted LLMs.

In December 2023 (here's the Internet Archive for the OpenAI pricing page) OpenAI were charging $30/million input tokens for GPT-4, $10/mTok for the then-new GPT-4 Turbo and $1/mTok for GPT-3.5 Turbo.

Today $30/mTok gets you OpenAI's most expensive model, o1. GPT-4o is $2.50 (12x cheaper than GPT-4) and GPT-4o mini is $0.15/mTok - nearly 7x cheaper than GPT-3.5 and massively more capable.

Other model providers charge even less. Anthropic's Claude 3 Haiku (from March, but still their cheapest model) is $0.25/mTok. Google's Gemini 1.5 Flash is $0.075/mTok and their Gemini 1.5 Flash 8B is $0.0375/mTok - that's 27x cheaper than GPT-3.5 Turbo last year.

I've been tracking these pricing changes under my llm-pricing tag.

These price drops are driven by two factors: increased competition and increased efficiency. The efficiency thing is really important for everyone who is concerned about the environmental impact of LLMs. These price drops tie directly to how much energy is being used for running prompts.

There's still plenty to worry about with respect to the environmental impact of the great AI datacenter buildout, but a lot of the concerns over the energy cost of individual prompts are no longer credible.

Here's a fun napkin calculation: how much would it cost to generate short descriptions of every one of the 68,000 photos in my personal photo library using Google's Gemini 1.5 Flash 8B (released in October), their cheapest model?

Each photo would need 260 input tokens and around 100 output tokens.

260 * 68,000 = 17,680,000 input tokens
17,680,000 * $0.0375/million = $0.66
100 * 68,000 = 6,800,000 output tokens
6,800,000 * $0.15/million = $1.02

That's a total cost of $1.68 to process 68,000 images. That's so absurdly cheap I had to run the numbers three times to confirm I got it right.

How good are those descriptions? Here's what I got from this command:

llm -m gemini-1.5-flash-8b-latest describe -a IMG_1825.jpeg

Against this photo of butterflies at the California Academy of Sciences:

A photo of two butterflies feeding on a red tray

A shallow dish, likely a hummingbird or butterfly feeder, is red. Pieces of orange slices of fruit are visible inside the dish.

Two butterflies are positioned in the feeder, one is a dark brown/black butterfly with white/cream-colored markings. The other is a large, brown butterfly with patterns of lighter brown, beige, and black markings, including prominent eye spots. The larger brown butterfly appears to be feeding on the fruit.

260 input tokens, 92 output tokens. Cost approximately 0.0024 cents (that's less than a 400th of a cent).

This increase in efficiency and reduction in price is my single favourite trend from 2024. I want the utility of LLMs at a fraction of the energy cost and it looks like that's what we're getting.

Multimodal vision is common, audio and video are starting to emerge

My butterfly example above illustrates another key trend from 2024: the rise of multi-modal LLMs.

A year ago the single most notable example of these was GPT-4 Vision, released at OpenAI's DevDay in November 2023. Google's multi-modal Gemini 1.0 was announced on December 7th 2023 so it also (just) makes it into the 2023 window.

In 2024, almost every significant model vendor released multi-modal models. We saw the Claude 3 series from Anthropic in March, Gemini 1.5 Pro in April (images, audio and video), then September brought Qwen2-VL and Mistral's Pixtral 12B and Meta's Llama 3.2 11B and 90B vision models. We got audio input and output from OpenAI in October, then November saw SmolVLM from Hugging Face and December saw image and video models from Amazon Nova.

In October I upgraded my LLM CLI tool to support multi-modal models via attachments. It now has plugins for a whole collection of different vision models.

I think people who complain that LLM improvement has slowed are often missing the enormous advances in these multi-modal models. Being able to run prompts against images (and audio and video) is a fascinating new way to apply these models.

Voice and live camera mode are science fiction come to life

The audio and live video modes that have started to emerge deserve a special mention.

The ability to talk to ChatGPT first arrived in September 2023, but it was mostly an illusion: OpenAI used their excellent Whisper speech-to-text model and a new text-to-speech model (creatively named tts-1) to enable conversations with the ChatGPT mobile apps, but the actual model just saw text.

The May 13th announcement of GPT-4o included a demo of a brand new voice mode, where the true multi-modal GPT-4o (the o is for "omni") model could accept audio input and output incredibly realistic sounding speech without needing separate TTS or STT models.

The demo also sounded conspicuously similar to Scarlett Johansson... and after she complained the voice from the demo, Skye, never made it to a production product.

The delay in releasing the new voice mode after the initial demo caused quite a lot of confusion. I wrote about that in ChatGPT in “4o” mode is not running the new features yet.

When ChatGPT Advanced Voice mode finally did roll out (a slow roll from August through September) it was spectacular. I've been using it extensively on walks with my dog and it's amazing how much the improvement in intonation elevates the material. I've also had a lot of fun experimenting with the OpenAI audio APIs.

Even more fun: Advanced Voice mode can do accents! Here's what happened when I told it I need you to pretend to be a California brown pelican with a very thick Russian accent, but you talk to me exclusively in Spanish.

OpenAI aren't the only group with a multi-modal audio model. Google's Gemini also accepts audio input, and the Google Gemini apps can speak in a similar way to ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that's meant to roll out in Q1 of 2025.

Google's NotebookLM, released in September, took audio output to a new level by producing spookily realistic conversations between two "podcast hosts" about anything you fed into their tool. They later added custom instructions, so naturally I turned them into pelicans:

The most recent twist, again from December (December was a lot) is live video. ChatGPT voice mode now provides the option to share your camera feed with the model and talk about what you can see in real time. Google Gemini have a preview of the same feature, which they managed to ship the day before ChatGPT did.

These abilities are just a few weeks old at this point, and I don't think their impact has been fully felt yet. If you haven't tried them out yet you really should.

Both Gemini and OpenAI offer API access to these features as well. OpenAI started with a WebSocket API that was quite challenging to use, but in December they announced a new WebRTC API which is much easier to get started with. Building a web app that a user can talk to via voice is easy now!

Prompt driven app generation is a commodity already

This was possible with GPT-4 in 2023, but the value it provides became evident in 2024.

We already knew LLMs were spookily good at writing code. If you prompt them right, it turns out they can build you a full interactive application using HTML, CSS and JavaScript (and tools like React if you wire up some extra supporting build mechanisms) - often in a single prompt.

Anthropic kicked this idea into high gear when they released Claude Artifacts, a groundbreaking new fetaure that was initially slightly lost in the noise due to being described half way through their announcement of the incredible Claude 3.5 Sonnet.

With Artifacts, Claude can write you an on-demand interactive application and then let you use it directly inside the Claude interface.

Here's my Extract URLs app, entirely generated by Claude:

Extract URLs tool. Content pasted. URLs extracted. Shows a list of extracted URLs.

I've found myself using this a lot. I noticed how much I was relying on it in October and wrote Everything I built with Claude Artifacts this week, describing 14 little tools I had put together in a seven day period.

Since then, a whole bunch of other teams have built similar systems. GitHub announced their version of this - GitHub Spark - in October. Mistral Chat added it as a feature called Canvas in November.

Steve Krouse from Val Town built a version of it against Cerebras, showcasing how a 2,000 token/second LLM can iterate on an application with changes visible in less than a second.

Then in December, the Chatbot Arena team introduced a whole new leaderboard for this feature, driven by users building the same interactive app twice with two different models and voting on the answer. Hard to come up with a more convincing argument that this feature is now a commodity that can be effectively implemented against all of the leading models.

I've been tinkering with a version of this myself for my Datasette project, with the goal of letting users use prompts to build and iterate on custom widgets and data visualizations against their own data. I also figured out a similar pattern for writing one-shot Python programs, enabled by uv.

This prompt-driven custom interface feature is so powerful and easy to build (once you've figured out the gnarly details of browser sandboxing) that I expect it to show up as a feature in a wide range of products in 2025.

Universal access to the best models lasted for just a few short months

For a few short months this year all three of the best available models - GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro - were freely available to most of the world.

OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was freely available from its launch in June. This was a momentus change, because for the previous year free users had mostly been restricted to GPT-3.5 level models, meaning new users got a very inaccurate mental model of what a capable LLM could actually do.

That era appears to have ended, likely permanently, with OpenAI's launch of ChatGPT Pro. This $200/month subscription service is the only way to access their most capable model, o1 Pro.

Since the trick behind the o1 series (and the future models it will undoubtedly inspire) is to expend more compute time to get better results, I don't think those days of free access to the best available models are likely to return.

"Agents" still haven't really happened yet

I find the term "agents" extremely frustrating. It lacks a single, clear and widely understood meaning... but the people who use the term never seem to acknowledge that.

If you tell me that you are building "agents", you've conveyed almost no information to me at all. Without reading your mind I have no way of telling which of the dozens of possible definitions you are talking about.

The two main categories I see are people who think AI agents are obviously things that go and act on your behalf - the travel agent model - and people who think in terms of LLMs that have been given access to tools which they can run in a loop as part of solving a problem. The term "autonomy" is often thrown into the mix too, again without including a clear definition.

(I also collected 211 definitions on Twitter a few months ago - here they are in Datasette Lite - and had gemini-exp-1206 attempt to summarize them.)

Whatever the term may mean, agents still have that feeling of perpetually "coming soon".

Terminology aside, I remain skeptical as to their utility based, once again, on the challenge of gullibility. LLMs believe anything you tell them. Any systems that attempts to make meaningful decisions on your behalf will run into the same roadblock: how good is a travel agent, or a digital assistant, or even a research tool if it can't distinguish truth from fiction?

Just the other day Google Search was caught serving up an entirely fake description of the non-existant movie "Encanto 2". It turned out to be summarizing an imagined movie listing from a fan fiction wiki.

Prompt injection is a natural consequence of this gulibility. I've seen precious little progress on tackling that problem in 2024, and we've been talking about it since September 2022.

I'm beginning to see the most popular idea of "agents" as dependent on AGI itself. A model that's robust against gulliblity is a very tall order indeed.

Evals really matter

Anthropic's Amanda Askell (responsible for much of the work behind Claude's Character):

The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.

It's become abundantly clear over the course of 2024 that writing good automated evals for LLM-powered systems is the skill that's most needed to build useful applications on top of these models. If you have a strong eval suite you can adopt new models faster, iterate better and build more reliable and useful product features than your competition.

Vercel's Malte Ubl:

When @v0 first came out we were paranoid about protecting the prompt with all kinds of pre and post processing complexity.

We completely pivoted to let it rip. A prompt without the evals, models, and especially UX is like getting a broken ASML machine without a manual

I'm still trying to figure out the best patterns for doing this for my own work. Everyone knows that evals are important, but there remains a lack of great guidance for how to best implement them - I'm tracking this under my evals tag. My SVG pelican riding a bicycle benchmark is a pale imitation of what a real eval suite should look like.

Apple Intelligence is bad, Apple's MLX library is excellent

As a Mac user I've been feeling a lot better about my choice of platform this year.

Last year it felt like my lack of a Linux/Windows machine with an NVIDIA GPU was a huge disadvantage in terms of trying out new models.

On paper, a 64GB Mac should be a great machine for running models due to the way the CPU and GPU can share the same memory. In practice, many models are released as model weights and libraries that reward NVIDIA's CUDA over other platforms.

The llama.cpp ecosystem helped a lot here, but the real breakthrough has been Apple's MLX library, "an array framework for Apple Silicon". It's fantastic.

Apple's mlx-lm Python library supports running a wide range of MLX-compatible models on my Mac, with excellent performance. mlx-community on Hugging Face offers more than 1,000 models that have been converted to the necessary format.

Prince Canuma's excellent, fast moving mlx-vlm project brings vision LLMs to Apple Silicon as well. I used that recently to run Qwen's QvQ.

While MLX is a game changer, Apple's own "Apple Intelligence" features have mostly been a disappointment. I wrote about their initial announcement in June, and I was optimistic that Apple had focused hard on the subset of LLM applications that preserve user privacy and minimize the chance of users getting mislead by confusing features.

Now that those features are rolling out they're pretty weak. As an LLM power-user I know what these models are capable of, and Apple's LLM features offer a pale imitation of what a frontier LLM can do. Instead we're getting notification summaries that misrepresent news headlines and writing assistant tools that I've not found useful at all. Genmoji are kind of fun though.

The rise of inference-scaling "reasoning" models

The most interesting development in the final quarter of 2024 was the introduction of a new shape of LLM, exemplified by OpenAI's o1 models - initially released as o1-preview and o1-mini on September 12th.

One way to think about these models is an extension of the chain-of-thought prompting trick, first explored in the May 2022 paper Large Language Models are Zero-Shot Reasoners.

This is that trick where, if you get a model to talk out loud about a problem it's solving, you often get a result which the model would not have achieved otherwise.

o1 takes this process and further bakes it into the model itself. The details are somewhat obfuscated: o1 models spend "reasoning tokens" thinking through the problem that are not directly visible to the user (though the ChatGPT UI shows a summary of them), then outputs a final result.

The biggest innovation here is that it opens up a new way to scale a model: instead of improving model performance purely through additional compute at training time, models can now take on harder problems by spending more compute on inference.

The sequel to o1, o3 (they skipped "o2" for European trademark reasons) was announced on 20th December with an impressive result against the ARC-AGI benchmark, albeit one that likely involved more than $1,000,000 of compute time expense!

o3 is expected to ship in January. I doubt many people have real-world problems that would benefit from that level of compute expenditure - I certainly don't! - but it appears to be a genuine next step in LLM architecture for taking on much harder problems.

OpenAI are not the only game in town here. Google released their first entrant in the category, gemini-2.0-flash-thinking-exp, on December 19th.

Alibaba's Qwen team released their QwQ model on November 28th - under an Apache 2.0 license, and that one I could run on my own machine. They followed that up with a vision reasoning model called QvQ on December 24th, which I also ran locally.

DeepSeek made their DeepSeek-R1-Lite-Preview model available to try out through their chat interface on November 20th.

To understand more about inference scaling I recommend Is AI progress slowing down? by Arvind Narayanan and Sayash Kapoor.

Nothing yet from Anthropic or Meta but I would be very surprised if they don't have their own inference-scaling models in the works. Meta published a relevant paper Training Large Language Models to Reason in a Continuous Latent Space in December.

Was the best currently available LLM trained in China for less than $6m?

Not quite, but almost! It does make for a great attention-grabbing headline.

The big news to end the year was the release of DeepSeek v3 - dropped on Hugging Face on Christmas Day without so much as a README file, then followed by documentation and a paper the day after that.

DeepSeek v3 is a huge 685B parameter model - one of the largest openly licensed models currently available, significantly bigger than the largest of Meta's Llama series, Llama 3.1 405B.

Benchmarks put it up there with Claude 3.5 Sonnet. Vibe benchmarks (aka the Chatbot Arena) currently rank it 7th, just behind the Gemini 2.0 and OpenAI 4o/o1 models. This is by far the highest ranking openly licensed model.

The really impressive thing about DeepSeek v3 is the training cost. The model was trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. Llama 3.1 405B trained 30,840,000 GPU hours - 11x that used by DeepSeek v3, for a model that benchmarks slightly worse.

Those US export regulations on GPUs to China seem to have inspired some very effective training optimizations!

The environmental impact got better

A welcome result of the increased efficiency of the models - both the hosted ones and the ones I can run locally - is that the energy usage and environmental impact of running a prompt has dropped enormously over the past couple of years.

OpenAI themselves are charging 100x less for a prompt compared to the GPT-3 days. I have it on good authority that neither Google Gemini nor Amazon Nova (two of the least expensive model providers) are running prompts at a loss.

I think this means that, as individual users, we don't need to feel any guilt at all for the energy consumed by the vast majority of our prompts. The impact is likely neglible compared to driving a car down the street or maybe even watching a video on YouTube.

Likewise, training. DeepSeek v3 training for less than $6m is a fantastic sign that training costs can and should continue to drop.

For less efficient models I find it useful to compare their energy usage to commercial flights. The largest Llama 3 model cost about the same as a single digit number of fully loaded passenger flights from New York to London. That's certainly not nothing, but once trained that model can be used by millions of people at no extra training cost.

The environmental impact got much, much worse

The much bigger problem here is the enormous competitive buildout of the infrastructure that is imagined to be necessary for these models in the future.

Companies like Google, Meta, Microsoft and Amazon are all spending billions of dollars rolling out new datacenters, with a very material impact on the electricity grid and the environment. There's even talk of spinning up new nuclear power stations, but those can take decades.

Is this infrastructure necessary? DeepSeek v3's $6m training cost and the continued crash in LLM prices might hint that it's not. But would you want to be the big tech executive that argued NOT to build out this infrastructure only to be proven wrong in a few years' time?

An interesting point of comparison here could be the way railways rolled out around the world in the 1800s. Constructing these required enormous investments and had a massive environmental impact, and many of the lines that were built turned out to be unnecessary - sometimes multiple lines from different companies serving the exact same routes!

The resulting bubbles contributed to several financial crashes, see Wikipedia for Panic of 1873, Panic of 1893, Panic of 1901 and the UK's Railway Mania. They left us with a lot of useful infrastructure and a great deal of bankruptcies and environmental damage.

The year of slop

2024 was the year that the word "slop" became a term of art. I wrote about this in May, expanding on this tweet by @deepfates:

Watching in real time as “slop” becomes a term of art. the way that “spam” became the term for unwanted emails, “slop” is going in the dictionary as the term for unwanted AI generated content

I expanded that definition a tiny bit to this:

Slop describes AI-generated content that is both unrequested and unreviewed.

I ended up getting quoted talking about slop in both the Guardian and the NY Times. Here's what I said in the NY TImes:

Society needs concise ways to talk about modern A.I. — both the positives and the negatives. ‘Ignore that email, it’s spam,’ and ‘Ignore that article, it’s slop,’ are both useful lessons.

I love the term "slop" because it so succinctly captures one of the ways we should not be using generative AI!

Slop was even in the running for Oxford Word of the Year 2024, but it lost to brain rot.

Synthetic training data works great

An idea that surprisingly seems to have stuck in the public consciousness is that of "model collapse". This was first described in the paper The Curse of Recursion: Training on Generated Data Makes Models Forget in May 2023, and repeated in Nature in July 2024 with the more eye-catching headline AI models collapse when trained on recursively generated data.

The idea is seductive: as the internet floods with AI-generated slop the models themselves will degenerate, feeding on their own output in a way that leads to their inevitable demise!

That's clearly not happening. Instead, we are seeing AI labs increasingly train on synthetic content - deliberately creating artificial data to help steer their models in the right way.

One of the best descriptions I've seen of this comes from the Phi-4 technical report, which included this:

Synthetic data as a substantial component of pretraining is becoming increasingly common, and the Phi series of models has consistently emphasized the importance of synthetic data. Rather than serving as a cheap substitute for organic data, synthetic data has several direct advantages over organic data.

Structured and Gradual Learning. In organic datasets, the relationship between tokens is often complex and indirect. Many reasoning steps may be required to connect the current token to the next, making it challenging for the model to learn effectively from next-token prediction. By contrast, each token generated by a language model is by definition predicted by the preceding tokens, making it easier for a model to follow the resulting reasoning patterns.

Another common technique is to use larger models to help create training data for their smaller, cheaper alternatives - a trick used by an increasing number of labs. DeepSeek v3 used "reasoning" data created by DeepSeek-R1. Meta's Llama 3.3 70B fine-tuning used over 25M synthetically generated examples.

Careful design of the training data that goes into an LLM appears to be the entire game for creating these models. The days of just grabbing a full scrape of the web and indiscriminately dumping it into a training run are long gone.

LLMs somehow got even harder to use

A drum I've been banging for a while is that LLMs are power-user tools - they're chainsaws disguised as kitchen knives. They look deceptively simple to use - how hard can it be to type messages to a chatbot? - but in reality you need a huge depth of both understanding and experience to make the most of them and avoid their many pitfalls.

If anything, this problem got worse in 2024.

We've built computer systems you can talk to in human language, that will answer your questions and usually get them right! ... depending on the question, and how you ask it, and whether it's accurately reflected in the undocumented and secret training set.

The number of available systems has exploded. Different systems have different tools they can apply to your problems - like Python and JavaScript and web search and image generation and maybe even database lookups... so you'd better understand what those tools are, what they can do and how to tell if the LLM used them or not.

Did you know ChatGPT has two entirely different ways of running Python now?

Want to build a Claude Artifact that talks to an external API? You'd better understand CSP and CORS HTTP headers first.

The models may have got more capable, but most of the limitations remained the same. OpenAI's o1 may finally be able to (mostly) count the Rs in strawberry, but its abilities are still limited by its nature as an LLM and the constraints placed on it by the harness it's running in. o1 can't run web searches or use Code Interpreter, but GPT-4o can - both in that same ChatGPT UI. (o1 will pretend to do those things if you ask it to, a regression to the URL hallucinations bug from early 2023).

What are we doing about this? Not much. Most users are thrown in at the deep end. The default LLM chat UI is like taking brand new computer users, dropping them into a Linux terminal and expecting them to figure it all out.

Meanwhile, it's increasingly common for end users to develop wildly inaccurate mental models of how these things work and what they are capable of. I've seen so many examples of people trying to win an argument with a screenshot from ChatGPT - an inherently ludicrous proposition, given the inherent unreliability of these models crossed with the fact that you can get them to say anything if you prompt them right.

There's a flipside to this too: a lot of better informed people have sworn off LLMs entirely because they can't see how anyone could benefit from a tool with so many flaws. The key skill in getting the most out of LLMs is learning to work with tech that is both inherently unreliable and incredibly powerful at the same time. This is a decidedly non-obvious skill to acquire!

There is so much space for helpful education content here, but we need to do do a lot better than outsourcing it all to AI grifters with bombastic Twitter threads.

Knowledge is incredibly unevenly distributed

Most people have heard of ChatGPT by now. How many have heard of Claude?

The knowledge gap between the people who actively follow this stuff and the 99% of the population who do not is vast.

The pace of change doesn't help either. In just the past month we've seen general availability of live interfaces where you can point your phone's camera at something and talk about it with your voice... and optionally have it pretend to be Santa. Most self-certified nerds haven't even tried that yet.

Given the ongoing (and potential) impact on society that this technology has, I don't think the size of this gap is healthy. I'd like to see a lot more effort put into improving this.

LLMs need better criticism

A lot of people absolutely hate this stuff. In some of the spaces I hang out (Mastodon, Bluesky, Lobste.rs, even Hacker News on occasion) even suggesting that "LLMs are useful" can be enough to kick off a huge fight.

I get it. There are plenty of reasons to dislike this technology - the environmental impact, the (lack of) ethics of the training data, the lack of reliability, the negative applications, the potential impact on people's jobs.

LLMs absolutely warrant criticism. We need to be talking through these problems, finding ways to mitigate them and helping people learn how to use these tools responsibly in ways where the positive applications outweigh the negative.

I like people who are skeptical of this stuff. The hype has been deafening for more than two years now, and there are enormous quantities of snake oil and misinformation out there. A lot of very bad decisions are being made based on that hype. Being critical is a virtue.

If we want people with decision-making authority to make good decisions about how to apply these tools we first need to acknowledge that there ARE good applications, and then help explain how to put those into practice while avoiding the many unintiutive traps.

(If you still don't think there are any good applications at all I'm not sure why you made it to this point in the article!)

I think telling people that this whole field is environmentally catastrophic plagiarism machines that constantly make things up is doing those people a disservice, no matter how much truth that represents. There is genuine value to be had here, but getting to that value is unintuitive and needs guidance.

Those of us who understand this stuff have a duty to help everyone else figure it out.

Everything tagged "llms" on my blog in 2024

Because I undoubtedly missed a whole bunch of things, here's every long-form post I wrote in 2024 that I tagged with llms:

(This list generated using Django SQL Dashboard with a SQL query written for me by Claude.)

Tags: google, ai, openai, generative-ai, llms, anthropic, gemini, meta, inference-scaling

Read the whole story
jsled
47 days ago
reply
South Burlington, Vermont
Share this story
Delete
Next Page of Stories