How my AI report on housework started well, then went off the rails
I’ve long been interested in the topic of housework, as you can see from this Crooked Timber post, which produced a long and unusually productive discussion thread [fn1]. The issue came up again in relation to the prospects for humanoid robots. It’s also at the edge of bunch of debates going on (mostly on Substack) about living standards and birth rates.
I’m also interested (like nearly everyone, one way or another) in “Artificial Intelligence” (scare quotes intentional). My current position is, broadly, that it’s what Google should have become instead of being steadily enshittified in the pursuit of advertising dollars. But I’m alert to other possibilities, including that more investment will deliver something that genuinely justifies the name AI. And I think a lot of the concerns about power and water use, the spread of AI slop and so on are either overstated or (as with deepfakes) are mostly new iterations of concerns that always arise with new IT and communications technology, and can be addressed with existing conceptual and legal tools.
With this background, I thought it would be interesting to try out ChatGPTs Deep Research (DR) on the question of what has happened to housework and why. As I may have mentioned before, I’ve trained DR on a big volume of my own writing. That produces a passable imitation of my style, and means I don’t worry about the ethical issues of plagiarising the writing style of others (of course, standard norms of citation and attribution still apply).
I decided to focus on single-person households, to abstract away from the issues of child-raising (which I want to look at separately) and the allocation of work between partners (about which there is a vast literature to which I can’t add anything new).
Everything went really well to start with. I prompted DR for time use data, then pushed further on with more detailed questions like the impact of air fryers on male cooking habits (I was given one recently and was impressed enough that I promptly bought a second). I asked for a literature search and got references to Judy Wajcman and Michael Bittman, both of whom I knew and a couple of people I didn’t. DR missed Ruth Schwartz Cowan’s classic More Work for Mother.
On the other hand, I wasn’t aware of Wajcman’s recent Pressed for Time and hadn’t thought about the broader issue of life administration, which DR pointed out. I gave it a more economistic take, trying to divide labour-saving innovation (electronic bill paying) from the labour costs of more digital consumption (retrieving passwords for streaming services etc).
I got DR to produce a LaTeX file, and was nearly ready to go to digital press when I noticed that the references were incomplete. At this stage, the whole process spiralled into disaster. Every draft seemed to lose more material, and to be worse written. Finally, I demanded an explanation\
Source: Enkiai. com
DR graciously explained that this is the way it does things. Here’s its response with original formatting (more detailed version here)
Why Deep Research Fails
The core issue: The “illusion of continuity” in long-context LLMs
Large-context models appear to preserve every part of the conversation, but in reality:
They do not reliably retain or retrieve long-form outputs they themselves generated.
Even though the conversation window may be large, the model often cannot:
- accurately locate the version you consider the “full text,”
- reliably differentiate between multiple drafts (short, medium, long) created earlier,
- or accurately regenerate from earlier output unless it is verbatim re-supplied by the user.
This is because the internal contextual representation compresses earlier tokens. Information is not stored sequentially; it is represented in a latent space, which cannot always reconstruct past outputs faithfully.
Fortunately, my LaTeX program took continuous backups, so I’ve been able to retrieve the last good version, but I’m going to keep it away from DR for now.
I was going to go on with more detail about the actual report, but my op-ed training leads me to feel that a post should have 700 words, and I am at 675 as I type this.
fn1. I can take a victory lap on my jihad/crusade against ironing, which has disappeared almost entirely, contradicting the expectations of many commenters.
