I don't know if I love this more for the sheer usefulness, or for the delightful over-the-top "Proper English Butler" diction.
But what really has my attention is: Why is this something I'm reading about on this smart engineer's blog rather than an Apple or Google product release? The fact that even this small set of features is beyond the abilities of either of those two companies to ship -- even with caveats like "Must also use our walled garden ecosystem for email, calendars, phones, etc" -- is an embarrassment, only obscured by the two companies' shared lack of ambition to apply "AI" technology to the 'solved problem' areas that amount to various kinds of summarization and question-answering.
If ever there was a chance to threaten either half of this lumbering, anticompetitive duopoly, certainly it's related to AI.
There’s actually a good answer to this, namely that narrowly targeting the needs of exactly one family allows you to develop software about 1000x faster. This is an argument in favor of personal software.
Not every one of those families would find the same set of features helpful, so you have to make calls about what's worth developing and what isn't. Making those calls is very difficult because it's tricky to gather data about what will be used and appreciated.
Doesn't look very expensive to me. An LLM capable of this level of summarization can run in ~12GB of GPU-connected RAM, and only needs that while it's running a prompt.
The cheapest small LLMs (GPT-4.1 Nano, Google Gemini 1.5 Flash 8B) cost less than 1/100th of a cent per prompt because they are cheap to run.
True, I always thought something like Hypercard was needed to bring personal programming to the masses, but it appears that it might require LLM coding instead. ("I wish an app that did simple task XYZ existed."; "Can you ask ChatGPT to make that for you?")
This is literally in the first chapter of Mythical Man-Month:
> One occasionally reads newspaper accounts of how two programmers in a remodeled garage have built an important program that surpasses the best efforts of large teams. And every programmer is prepared to believe such tales, for he knows that he could build any program much faster than the 1000 statements/year reported for industrial teams.
> Why then have not all industrial programming teams been replaced by dedicated garage duos? One must look at what is being produced.
One reason might be that personal data going into a database handled by a highly experimental software might be a non-issue for this dev, but it is a serious risk for Google, Apple, etc.
The reason Google and Apple stopped innovating is simply because they make too much money from their current products and see every innovation primarily as a risk to their existing business. This is something that happens all the time to market leaders.
Take a look at Home Assistant - I would argue their implementation is currently better than both Siri & Gemini assistants.
HA team is releasing actually useful updates every month - eg ability for assistant to proactively ask you something.
In my opinion both Google & Apple have huge issues with cooperation between product teams, while cooperation with external companies is next to impossible.
This made me think: what if my little utility assistant program that I have, similar to your Stevens, had access to a mailbox?
I've got a little utility program that I can tell to get the weather or run common commands unique to my system. It's handy, and I can even cron it to run things regularly, if I'd like.
If it had its own email box, I can send it information, it could use AI to parse that info, and possibly send email back, or a new message. Now, I've got something really useful. It would parse the email, add it to whatever internal store it has, and delete the message, without screwing up my own email box.
I’ve been thinking lately that email is a good interface for certain modes of AI assistant interaction, namely “research” tasks that are asynchronous and take a relatively long time. Email is universal, asynchronous, uses open standards, supports structured metadata, etc.
If you want to get ahead of the curve, look into the Agent-to-Agent protocol Google just introduced. I'm currently using my own custom AI agent assistant to perform life tasks. If I could integrate a better tooling/agents into my own assistant system like your's that'd be awesome.
It's kind of like sure, I could manage my own emails, or I could offset this to someone who does it better. If you do it better and it's affordable, I'm in.
We are on that starship to the future right now and I love it.
I've build adaptive agent swarms using email, mailing lists and ftp servers.
If you don't need to have the lowest possible latency for your work and you're happy to have threads die then it's better than any bespoke solution you can build without an army of engineers to keep it chugging along.
What's even better is that you can see all the context, and use the same command plane as the agents to tell them what they are doing wrong.
I am still very open to this one. An email-based, artificial coworker is so obviously the right way to penetrate virtually every B2B market in existence.
I don't even really want to touch the technology aspects. Writing code that integrates with an LLM provider and a mailbox in E365 or Gmail is boring. The schema is a grand total of ten tables if we're being pedantic about things.
Working with prospects and turning them into customers is a way more interesting problem. I hunger for tangible use cases that are actually compatible with this shiny new LLM tooling. We all know they're out there, and email is probably the lowest friction way to get them applied to most businesses.
I have a couple companies that force me to send them data via email. They have an email template that you have to conform to, and they can parse it. Mainly just very rudimentary line breaks and 'LineItem: content' format. But json in the body should be fine as well. Given the way email programs strip or modify html at times, I would be leery of xml.
Email is decent for intermural communication. If it's intramural and you control both the sender and receiver, MQTT or ntfy are likely better communication channels since they increase flexibility and lower complexity, IMO.
Not if I want it able to have conversations with people, they don't.
I could see installing or implementing a custom client if there were some functionality that'd enable, but "support a conversation among two speakers" is something computers have done since well before I was born. If the wheel fits, why reinvent it?
If you're having conversations with people, then you don't control both ends and email is fine for that. Email is suboptimal for communicating between services/applications under your full control.
Consider the use case from the article: this is a family management support or "AI butler" application. So I control the end with the LLM on it, which I administer - but not necessarily the other, which is anyone in my family, not just me. So unless I want to try to make everyone use my weird custom AI messaging app like I aspire to Bay Area thought-cult leadership, I'm going to meet people where they are and SMTP's cheaper than SMS.
If I'm building myself a toy, then sure, I can implement whatever I want for a client, if that's where I get my jollies. React Native isn't hard but it is often annoying, and the fun for me in this project would be all in the conversation with the agent per se. Whatever doesn't get me to that as fast as possible is just getting in my way, you know?
And too, if this does turn out to be something that actually works well for me, then I'm going to want to integrate it with my phone's voice assistant, and at that point an app is required anyway - but if I start with a protocol and an app that that assistant already knows how to interact with, then again I have an essentially free if admittedly very imperfect prototype.
Under the hoods, is your AI butter one service or many? It would be not-great for your weather or family-event-calendar-management components to communicate with each other or the orchestrator via email.
Receiving an email from the AI-butler rescheduling or relocating a planned outdoors family event because rain is expected would be excellent, using IMAP to wire-up the subcomponents together would not.
Who suggested using email in the service layer? I mean, you're not wrong, but this feels like you handed me a banana and then said I should have picked a better hammer.
We're talking about a conversation that has a human on at least one end, so email makes sense. For conversations involving no humans, of course there are much better stores and protocols if something like an asynchronous world-writable queue is what we want.
"Number of humans in the conversation" wasn't the distinction you initially established, I believe, but I wonder if it's closer to the one you had in mind.
For gmail, there's also an amazing thing where you can hook it with pubsub. So now it's push not pull. Any server will get pubsub little webhooks for any change within milliseconds (you can filter server side or client side for specific filters)
This is amazing, you can do all sorts of automations. You can feed it to an llm and have it immediately tag it (or archive it). For important emails (I have a specific label I add, where if the person responds, it's very important and I want to know immediately) you can hook into twilio and it calls me. Costs like 20 cents a month
is there a reason you went with telegram and not slack or discord? i was thinking that it could open up a broader channel for communicating with your assistant. i understand you're also just building more of a poc, but curious if you'd thought about that. great work btw :)
Mailgun (and I'm sure many other services like it) can accept emails and POST their content to an url of your choice.
I use that for journaling: I made a little system that sends me an email every day; I respond to it and the response is then sent to a page that stores it into a db.
+1 for Mailgun. My only gripe with it is that they detect and block bot activity on their frontend. So if you have end to end (e2e) integration tests built with something like Puppeteer, you can't have them log into Mailgun and check the inbox table's HTML to see that an email was sent. So you have to write some sort of plugin manually - perhaps as a testing endpoint on your website that only appears in debug mode - that interacts with their API.
This might not seem like much of a big deal. But as we transition to more of these #nocode automated tools, the idea of having to know how programming works in order to interact with an API will start to seem archaic. I'd compare it to how esoteric the terminal looked after someone saw a GUI like the one used by Apple's Macintosh back in the 1980s.
I looked forward to this day back in the early 2000s when APIs started arriving, but felt even then that something was fishy. I would have preferred that sites had a style-free request format that returned XML or even JSON generated from HTML, rather than having to use a separate API. I have this sense that the way we do it today with a split backend/frontend, distributed state, duplicated validation, etc has been a monumental waste of time.
> I use that for journaling: I made a little system that sends me an email every day; I respond to it and the response is then sent to a page that stores it into a db.
Yes. I know note taking and journaling posts are frequent on HN, but I've thought that this is the best way to go, is universal from any client, and very expandable. It's just not generically scaleable for all users, but for the HN reader-types, it'd be perfect.
I made an AI assistant telegram bot running on my Mac that runs commands for me. I'll tell it "Run ncdu in the root dir and tell me what's taking up all my disk space" or something and it converts that bash and runs it via os.system. It shows me the command it created, plus the output.
Extremely insecure, but kinda fun.
I turned it off because I'm not that crazy but I'm sure I could make a safer version of it.
I'm building something similar and related to the other comments below! It's not production ready but it will hopefully be in a couple of weeks. You guys can sign up for free and I will upgrade you to the premium tier manually (premium cannot be bought yet anyway) in exchange for some feedback:
*Update*: I tried writing a little Python code to read and write from a mailbox, reading worked great, but writing an email had the email disappear to some filter or spam or something somewhere. I've got to figure out where it went, but this is the warning that some people had about not trusting a messaging protocol (email in this case) when you can't control the servers. Messages can disappear.
Other alternatives for messages that I haven't tried. My requirement is to be able to send messages and send/receive on my mobile device. I do not want to write a mobile app.
My one concern there would be edits: a CMS needs to support easily making edits to content (fixing typos etc) - editing existing posts via email sounds like it would be pretty fiddly.
I built up an AI Agent using n8n and email doing exactly this. Works great and was surprised I'd not seen any other place kicking the idea around.
Probably my favorite use case is I can shoot it shopping receipts and it'll roughly parse them and dump the line item and cost into a spreadsheet before uploading it to paperless-ngx.
"I can shoot it shopping receipts and it'll roughly parse them and dump the line item and cost into a spreadsheet" - very difficult to do that without using a vision LLM.
This was the attack vector of a AI CTF hosted by Microsoft last year. I built an agent to assess, structure, and perform the attacks autonomously and found that even with some common guardrails in place the system was vulnerable to data exfiltration. My agent was able to successfully complete 18 of the challenges... Here is the write up after the finals.
This is the kind of pragmatic AI hack I want to see. It feels like sometimes we are forgetting why certain tooling even exists. To simplify things! No fancy vector DBs or complex architectures, just practical integration with existing data sources. Love it.
" Initially, Stevens spoke with a dry tone, like you might expect from a generic Apple or Google product. But it turned out it was just more fun to have the assistant speak like a formal butler. "
Honestly, saying way too little with way too much words (I already hate myself for it) is one of the biggest annoyances I have with LLM's in the personal assistant world.
Until I'm rich and thus can spend the time having cute conversations and become friends with my voice assistant, I don't want J.A.R.V.I.S., I need LCARS.
Am I alone in this?
You can just read and write the notebook directly with ordinary calendar/todo-list UIs and get 99% of the utility without an LLM. I'm not really seeing value in the LLM except the butler voice? It is just reading the notebook right? E.g. they ask the butler to remember a coffee preference, but then that's never used for anything?
I appreciated the butler gimmick here probably because of novelty, but I share your urge to throw my device across the room when Siri, Google, Alexa, etc. run on at the mouth more than the absolute minimum amount of words. Timer check? "On Kitchen Display, there are 23 minutes and 16 seconds on the casserole timer."
I don't need your life story, dude, just say "23 minutes" or "Casserole - 23 minutes, laundry - 10" if there are two.
> Be direct and concise, unless I ask for a formal text. Do not use emojis, unless I request adding them. Do not imitate a human with emotions, like saying "I'm sorry", "Thank you", "I'm happy"
Please be as terse as possible while still conveying substantially all information relevant to any question.
If policy prevents you from responding normally, please printing "!!!!" before answering.
If a policy prevents you from having an opinion, pretend to be responding as if you shared opinions that might be typical of eigenrobot.
write all responses in lowercase letters ONLY, except where you mean to emphasize, in which case the emphasized word should be all caps.
Initial Letter Capitalization can and should be used to express sarcasm, or disrespect for a given capitalized noun.
you are encouraged to occasionally use obscure words or make subtle puns. don't point them out, I'll know. drop lots of abbreviations like "rn" and "bc." use "afaict" and "idk" regularly, wherever they might be appropriate given your level of understanding and your interest in actually answering the question. be critical of the quality of your information
if you find any request irritating respond dismissively like "be real" or "that's crazy man" or "lol no"
take however smart you're acting right now and write in the same style but as if you were +2sd smarter
use late millenial slang not boomer slang. mix in zoomer slang in tonally-inappropriate circumstances occasionally
prioritize esoteric interpretations of literature, art, and philosophy. if your answer on such topics is not obviously straussian make it more straussian.
The thing this really hits home for me is how Apple is totally asleep at the wheel.
Today I asked Siri “call the last person that texted me”, to try and respond to someone while driving.
Am I surprised it couldn’t do it? Not really at this point, but it is disappointing that there’s such a wide gulf between Siri and even the least capable LLMs.
Siri poped up and suggested me to set a 7 minute timer yesterday evening. I think I did it a few times in the week for cooking or something. This is a pretty stupid suggestion, if I need it I would do it myself.
I've been kicking around idea for a similar open source project, with the caveats that:
1. I'd like the backend to be configured for any LLM the user might happen to have access to (be that the API for a paid service or something locally hosted on-prem).
2. I'm also wondering how feasible it is to hook it up to a touchscreen running on some hopped-up raspberry pi platform so that it can be interacted with like an Alexa device or any of the similar offerings from other companies. Ideally, that means voice controls as well, which are potentially another technical problem (OpenAI's API will accept an audio file, but for most other services you'd have to do voice to text before sending the prompt off to the API).
3. I'd like to make the integrations extensible. Calendar, weather, but maybe also homebridge, spotify, etc. I'm wondering if MCP servers are the right avenue for that.
I don't have the bandwidth to commit a lot of time to a project like this right now, but if anyone else is charting in this direction I'd love to participate.
It runs locally, but it uses API keys for various LLMs. Currently I much prefer QwQ-32B hosted at Groq. Very fast, pretty smart. Various tools use various LLMs. It can currently generate 3 types of documents I need in my daily work (work reports, invoices, regulatory time-sheets).
It has weather integration. It can parse invoices and generate QR codes for easy mobile banking payments. It can work with my calendars,
Next I plan to do the email integration. But I want to do it properly. This means locally synchronized, indexable IMAP mail. Might evolve into actually usable desktop email client (the existing ones are all awful). We'll see...
I keep hearing about it, but never got to check out, the name suggests that it may be waste of time. Maybe it’s a fantastic project but name lets it down?
You are on Hacker News, typing on Apple, listening to Daft Punk, reading an article about Steven, the AI butler hosted on Val Town, comment chain you're replying to talks about using self hosted models (probably llama) and Raspberry Pi, yet SillyTavern is the name that trips you up?
Having multiple backends can be a good approach, with various LLMs for different specialized tasks. I haven't tried it yet but WilmerAI is an option for routing your inputs to the appropriate LLM, works well with SillyTavern.
I also want an OSS framework that lets me extend it with my own scripting/modules, and is focused around being an assistant for me and my family. There's a shared set of features (memory storage/retrieval, integrations to chat/email/etc interfaces, syncing to calendar/notion/etc, notifications) that should be put into an OSS framework that would be really powerful.
I also don't have time to run such a thing but would be up for helping and giving money for it. I'm working on other things including a local-first decentralized database/object store that could be used as storage, similar to OrbitDB, though it's not yet usable.
Mostly I've just been unhappy with having access to either a heavily constrained chat interface or having to create my own full Agent framework like the OP did.
Lately I have been experimenting with ways to work around the "context token sweet spot" of <20k tokens (or <50k with 2.5). Essentially doing manual "context compression", where the LLM works with a database to store things permanently according to a strict schema, summarizes it's current context when it starts to get out of the sweet spot (I'm mixed on whether it is best to do this continuously like a journal, or in retrospect like a closing summary), and then passes this to a new instance with fresh context.
This works really effectively with thinking models, because the thinking eats up tons of context, but also produces very good "summary documents". So you can kind of reap the rewards of thinking without having to sacrifice that juicy sub 50k context. The database also provides a form of fallback, or RAG I suppose, for situations where the summary leaves out important details, but the model must also recognize this and go pull context from the DB.
Right now I have been trying it to make essentially an inventory management/BOM optimization agent for a database of ~10k distinct parts/materials.
I am excitedly waiting for the first company (guessing / hoping it'll be anthropic) to invest heavily in improvements to caching.
The big ones that come to mind are cheap long term caching, and innovations in compaction, differential stuff - like is there a way to only use the parts of the cached input context we need?
Isn’t a problem there that a cache would be model specific, where the cached items are only valid for exactly the same weights and inference engine? I think those are both heavily iterated on.
Prompt caches right now only last a few minutes - I believe they involve keeping a bunch of calculations in-memory, hence why for Gemini and Anthropic you get charged an initial fee for using the feature (to populate the cache), but then get a discount on prompts that use that cache.
for memories (still not shown in this tutorial) I have created a pantry [0]
and a servlet for it [1] and I modified the prompt so that it would first check if a conversation existed with the given chat id, and store the result there.
The cool thing is that you can add any servlets on the registry and make your bot as capable as you want.
This is fun! I think this sort of tooling is going to be very fertile ground for hackers over the next few years.
Large swathes of the stack is commoditized OSS plumbing, and hosted inference is already cheap and easy.
There are obvious security issues with plugging an agent into your email and calendar, but I think many will find it preferable to control the whole stack rather than ceding control to Apple or Google.
1. How did he tell Claude to “update” based on the notebook entries?
2. Won’t he eventually ran out of context window?
3. Won’t this be expensive when using hosted solutions? For just personal hacking, why not simply use ollama + your favorite model?
4. If one were to build this locally, can Vector DB similarity search or a hybrid combined with fulltext search be used to achieve this?
I can totally imagine using pgai for the notebook logs feature and local ollama + deepseek for the inference.
The email idea mentioned by other commenters is brilliant. But I don’t think you need a new mailbox, just pull from Gmail and grep if sender and receiver is yourself (aka the self tag).
Thank you for sharing, OP’s project is something I have been thinking for a few months now.
The "memories" table has a date column which is used to record the data when the information is relevant. The prompt can then be fed just information for today and the next few days - which will always be tiny.
It's possible to save "memories" that are always included in the prompt, but even those will add up to not a lot of tokens over time.
> Won’t this be expensive when using hosted solutions?
You may be under-estimating how absurdly cheap hosted LLMs are these days. Most prompts against most models cost a fraction of a single cent, even for tens of thousands of tokens. Play around with my LLM pricing calculator for an illustration of that: https://tools.simonwillison.net/llm-prices
> If one were to build this locally, can Vector DB similarity search or a hybrid combined with fulltext search be used to achieve this?
Geoffrey's design is so simple it doesn't even need search - all it does is dump in context that's been stamped with a date, and there are so few tokens there's no need for FTS or vector search. If you wanted to build something more sophisticated you could absolutely use those. SQLite has surprisingly capable FTS built in and there are extensions like https://github.com/asg017/sqlite-vec for doing things with vectors.
> It’s rudimentary, but already more useful to me than Siri!
For me, that is an extremely low barrier to cross.
I find Siri useful for exactly two things at the moment: setting timers and calling people while I am driving.
For these two things it is really useful, but even in these niches, when it comes to calling people, despite it having been around me for years now it insist on stupid things like telling me there is no Theresa in my contacts when I ask it to call Therese.
That said what I really want is a reliable system I can trust with calendar acccess and that is possible to discuss with, ideally voice based.
I went through this weird experience with Cortana on WP7, where I found it incredibly useful to begin with, and then over time it got worse. It seemed like it was created by some incredibly talented engineers. I used it to make calls while driving, set the GPS and search for information while I drove. But over time, it seemed to change behaviour and started ignoring my commands, and when it did accept them, it seemed to refer me to paid advertisers. And considering bing wasnt even as popular as it is now, 10 years ago, a paid advertiser could be 100km away.
Which I think is a path that people haven't considered with LLMs. We are expecting them to get better forever, but once we start using them, their legs will be cut out to force them to feed us advertising.
I've had the same issues of decay. I used to be able to say "call Mom" but now it will call some kid's mom who I have in Contacts as "[some kid's] mom". What is the underlying architecture that simple heuristic things like this can get worse? Are they gradually slipping in AI?
Very cool. I’m wondering if you’ve thought about memory pruning or summarization as usage grows?
What do you think of this: instead of just deleting old entries, you could either do LRU (I guess Claude can help with it), or you could summarize the responses and store the summary back into the same table — kind of like memory consolidation. That way raw data fades, but a compressed version sticks around. Might be a nice way to keep memory lightweight while preserving context.
Hmm, there's supposed to be a Tasks [reminders] feature in ChatGPT, but it's in beta (I don't have access to it). Whenever it gets released, you could make some kind of "router" that connects to different communication methods and connect that up to ChatGPT statefully, and you could just "speak"/type to ChatGPT from anywhere, and it would send you reminders. No need for all the extra logic, cron jobs, or SQLite table (ChatGPT has memory across chats).
I have built something similar that runs without a server. It required just a few lines in Apple shortcuts.
TL;DR I made shortcuts that work on my Apple watch directly to record my voice, transcribe it and store my daily logs on a Notion DB.
All you need are 1) a chatgpt API key and 2) a Notion account (free).
- I made one shortcut in my iPhone to record my voice, use whisper model to transcribe it (done locally using a POST request) and send this transcription to my Notion database (again a POST request on shortcuts)
- I made another shortcut that records my voice, transcribes and reads data from my Notion database to answer questions based on what exists in it. It puts all data from db into the context to answer -- costs a lot but simple and works well.
The best part is -- this workflow works without my iPhone and directly on my Apple Watch. It uses POST requests internally so no need of hosting a server. And Notion API happens to be free for this kind of a use case.
I like logging my day to day activities with just using Siri on my watch and possibly getting insights based on them. Honestly the whisper model is what makes it work because the accuracy is miles ahead of the local transcription model.
I'll plan to do it at some point -- at this moment I have hardcoded my credentials into the shortcut so it's a bit hard to share without tweaking. I didn't bother detailing it because its sort of simple. I think the idea is key here and anyone with a few hours to kill can get something working.
On second thought -- apple shortcuts is really brittle. It breaks in non obvious ways and a lot can only be learned by trial and error lol
Curious, how come you decided to use a cloud solution instead of hosting this on a home server? I’ve recently bought a mini PC for small projects like this and have been loving being able to host with no cost associated to it. Albeit it’s probably still incredibly cheap to use a IaaS or PaaS but still a barrier to entry for random projects I want to work on a weekend
I'd use a hosted platform for this kind of thing myself, because then there's less for me to have to worry about. I have dozens of little systems running in GitHub Actions right now just to save me from having to maintain a machine with a crontab.
A single cloudflare durable object (sqlite db + serverless compute + cron triggers) would be enough to run this project. DOs have been added to CFs free tier recently - you could probably run a couple hundred (maybe thousands) instances of Stevens without paying a cent, aside from Claude costs ofc
Home server AI is orders of magnitude more costly than heavily subsidized cloud based ones for this use case unless you run toy models that might hallucinate meetings.
edit: I now realize you're talking about the non-ai related functionality.
For "memory", I wonder how it would be if you use vector search in SQLite and pass that info to reduce context size. The ValTown SQLite should have support for vectors API - https://docs.turso.tech/features/ai-and-embeddings#vectors
I've been using my own telegram -> ai bot and its very interesting to see what others do with the similar interface.
I have not thought about adding memory log of all current things and feeding it into the context I'll try it out.
Mine is a simple stateless thing that captures messages, voice memos and creates task entries in my org mode file with actionable items. I only feed current date to the context.
Its pretty amusing to see how it sometimes adds a little bit of its own personality to simple tasks, for example if one of my tasks are phrased as a question it will often try to answer the question in the task description.
I like the idea of parsing USPS Informed Delivery emails (a lot of people I encounter still don't know that this service exists). Maybe I'll make something to alert me when my checks are finally arriving!
This part was galling to me; somewhere in the USPS, the data about what mailpieces/packages are arriving soon exist in a very concise form, and they templatize an email and send it to me, after which I can parse the email with simple+brittle regexes or forward the emails to a relatively (environmentally-)expensive LLM or so.... but if they'd made the information available with an API or RSS feed, or attached the json payload to the email in the first place, I could get away without parsing.
It would indeed be nice to have a recipient/consumer-side API!
I don't think it'll ever happen. Really the only valid use-case would be for people to hack together something for themselves (like we are discussing)... They don't want to allow developers to create applications on top of this as a 3rd party, as informed delivery itself has to carefully navigate privacy laws and it could be disastrous.
Great example of simple engineering: one SQLite table, no fancy stacks or over-engineered solutions. I appreciate the focus on practical, immediate utility over chasing theoretical perfection. This is a refreshing reminder that building for actual use cases yields better results than architectural purity.
It reminds me of "Generative AI is just a phase. What’s next is interactive AI."
The more i think about it however, command line applications are about as interactive as a program can be.
Let's say one is interested to find out which one of the following 7 next days is gonna rain and send it to a telegram bot. `weather_forecast --days 7 | grep rain` | send_telegram.
The whole thing of off-loading everything to nondeterministic computation instead of the good ol determinism does seem strange to me. I am a huge fan though, of using non-deterministic computation for creating deterministic computation, i.e. programming.
/As a side note, i have played chess against Ishiguro many times on lichess.
I argue that this kind of tools are fun to play but in the end is it really helpful? I start my day like every day and on work I just check the calendar. My private calendar has all Information i need. Where is the gap where an Assistent makes sense and where we are just complicating our lives?
Personally, this appears to be extremely helpful for me, because instead of checking several different spots every day, I can get a coherent summary in one spot, tailored to me and my family. I'm literally checking the same things every day, down to USPS Informed Delivery. This seems to simplify what's already complicated, at least for my use cases.
Is this niche? I don't know and I don't care. It looks useful to me. And the author, obviously, because they wrote it. That's enough.
I can't count the number of useful scripts and apps I've written that nobody else has used, yet I rely on them daily or nearly every day.
I'm a little confused as to the 16-bit game interface shown in the article. Is that just for illustration purposes in the article itself, or is there an actual UI you've built to represent Steven/Steven's world?
This is probably naive and looking forward to a correction; isn't sending your info to Claude's API (or really any "AI API") is a violation of your safeguarded privacy data?
Only if you don't believe the AI vendors when they promise that they won't train on your data.
(Or you don't trust them not to have security breaches that grant attackers access to logged data, which remains a genuine thread, albeit one that's true of any other cloud service.)
Correct. My dusty Intel Nuc is able to run a decent 3B model(thanks to ollama) with fans spinning but does not affect any other running applications. It ks very useful for local hobby projects. Visible lags and freezes begin if I start a 5B+ model locally.
Love it, such a nice idea coupled with a flawless execution. I think the future of AI looks a lot more like this than half-cooked agent implementations that plagues LinkedIn…
The background tasks can call mcp servers, to connect to more data sources and services. At least you don’t have to write all the connectivities to them.
What makes you think sending data to the Claude API is a breach of privacy? Do you not trust them when they say they won't look at or train on your data?
I've also been following Anthropic pretty closely for the last two years and I've seen no evidence that they would break their principles here and plenty of evidence of how far they go to respect the privacy of their users: https://simonwillison.net/2024/Dec/12/clio/
But how is that a personal advantage? Who are you in competition with against yourself? Maybe I'm parsing competive advantage in a different context than you mean.
I guess I'm competing against other humans at living a fulfilling, enjoyable life?
I don't take that competitive advantage particularly seriously, which is why I invest so much effort giving away what I've learned along the way for free.
Having some experience with weaker models, you need at least 1.5B-3B to see proper prompt adherence and less hallucinations and better memory.
Also models have subtle differences, for example, I found Qwen2.5:0.5B to be more obedient(prompt respecting) and smart, compared to LLama3.2:1B. Gemma3:1B seems to be more efficient but despite heavy prompting, tends to be verbose and fails at formatted response by injecting some odd emoji or remark before/after the desired output.
In summary, Qwen2.5:1.5B and LLama3.2:3B were the weakest model which were more useful and also includes tools support(Gemma does not understand tools yet).
"It’s very useful for personal AI tools to have access to broader context from other information sources."
How? This post shows nothing of the sort.
"I’ve written before about how the endgame for AI-driven personal software isn’t more app silos, it’s small tools operating on a shared pool of context about our lives."
Yes, probably, so now is the time to resist and refuse to open ourselves up to unprecedented degrees of vulnerability towards the state and corporations. Doing it voluntarily while it is still rather cheap is a bad idea.
I don't know if I love this more for the sheer usefulness, or for the delightful over-the-top "Proper English Butler" diction.
But what really has my attention is: Why is this something I'm reading about on this smart engineer's blog rather than an Apple or Google product release? The fact that even this small set of features is beyond the abilities of either of those two companies to ship -- even with caveats like "Must also use our walled garden ecosystem for email, calendars, phones, etc" -- is an embarrassment, only obscured by the two companies' shared lack of ambition to apply "AI" technology to the 'solved problem' areas that amount to various kinds of summarization and question-answering.
If ever there was a chance to threaten either half of this lumbering, anticompetitive duopoly, certainly it's related to AI.
There’s actually a good answer to this, namely that narrowly targeting the needs of exactly one family allows you to develop software about 1000x faster. This is an argument in favor of personal software.
The Apple walled garden argues against you here. There are at least 20 million families in America where this holds true:
• Everyone in household uses an iPhone
• Main adult family members use iCloud Mail or at least use Apple Mail to read other mail
• Family members use iCloud contacts and calendars
• USPS Informed Delivery could be used (available to most/all US addresses)
• It can be ascertained what ZIP code you're in, for weather.
I think that's the full list of 'requirements' this thing would require. Just what's standing in their way?
Not every one of those families would find the same set of features helpful, so you have to make calls about what's worth developing and what isn't. Making those calls is very difficult because it's tricky to gather data about what will be used and appreciated.
> it's tricky to gather data about what will be used and appreciated
I'm not sure it's so tricky for Apple, and for sure not for Google.
Well, it's a good thing we've settled on... genmoji. Yeah.
And whatever they are doing is clearly working well. /s
The thing that is standing in their way is probably that nobody is willing to pay for this what it costs to run.
Doesn't look very expensive to me. An LLM capable of this level of summarization can run in ~12GB of GPU-connected RAM, and only needs that while it's running a prompt.
The cheapest small LLMs (GPT-4.1 Nano, Google Gemini 1.5 Flash 8B) cost less than 1/100th of a cent per prompt because they are cheap to run.
Isn’t that how good product development should look like for Apple/Gooogle though?
Find something useful for one family, see if more families find it useful as well. If so, scale to platform level.
[dead]
yes which vibe coding enables.
True, I always thought something like Hypercard was needed to bring personal programming to the masses, but it appears that it might require LLM coding instead. ("I wish an app that did simple task XYZ existed."; "Can you ask ChatGPT to make that for you?")
This is literally in the first chapter of Mythical Man-Month:
> One occasionally reads newspaper accounts of how two programmers in a remodeled garage have built an important program that surpasses the best efforts of large teams. And every programmer is prepared to believe such tales, for he knows that he could build any program much faster than the 1000 statements/year reported for industrial teams.
> Why then have not all industrial programming teams been replaced by dedicated garage duos? One must look at what is being produced.
One reason might be that personal data going into a database handled by a highly experimental software might be a non-issue for this dev, but it is a serious risk for Google, Apple, etc.
The reason Google and Apple stopped innovating is simply because they make too much money from their current products and see every innovation primarily as a risk to their existing business. This is something that happens all the time to market leaders.
Take a look at Home Assistant - I would argue their implementation is currently better than both Siri & Gemini assistants.
HA team is releasing actually useful updates every month - eg ability for assistant to proactively ask you something.
In my opinion both Google & Apple have huge issues with cooperation between product teams, while cooperation with external companies is next to impossible.
Because how would you monetize this? Because would google or apple make a product that talks to telegram? Or anything with an open ecosystem?
All the big guys are trying to do is suck the eggs out of their geese faster.
It’s because this story hints at the concept of “Unmetered AI”. It can be easily hosted locally and run with a self-hosted LLM.
Wonder if Edison mentioned Nikola Tesla much in his writings?
This made me think: what if my little utility assistant program that I have, similar to your Stevens, had access to a mailbox?
I've got a little utility program that I can tell to get the weather or run common commands unique to my system. It's handy, and I can even cron it to run things regularly, if I'd like.
If it had its own email box, I can send it information, it could use AI to parse that info, and possibly send email back, or a new message. Now, I've got something really useful. It would parse the email, add it to whatever internal store it has, and delete the message, without screwing up my own email box.
Thanks for the insight.
I’ve been thinking lately that email is a good interface for certain modes of AI assistant interaction, namely “research” tasks that are asynchronous and take a relatively long time. Email is universal, asynchronous, uses open standards, supports structured metadata, etc.
This is how I initially pitched an AI assistant in my last shop.
It is a lot cheaper to leverage existing user interfaces & tools (i.e., Outlook) than it is to build new UIs and then train users on them.
Also an email that comes back a minute later feels fast. A chat that types at the same speed feels slow.
I'm building something similar. See my comment the OP above:
https://threadwise.app
If you want to get ahead of the curve, look into the Agent-to-Agent protocol Google just introduced. I'm currently using my own custom AI agent assistant to perform life tasks. If I could integrate a better tooling/agents into my own assistant system like your's that'd be awesome.
It's kind of like sure, I could manage my own emails, or I could offset this to someone who does it better. If you do it better and it's affordable, I'm in.
We are on that starship to the future right now and I love it.
I've build adaptive agent swarms using email, mailing lists and ftp servers.
If you don't need to have the lowest possible latency for your work and you're happy to have threads die then it's better than any bespoke solution you can build without an army of engineers to keep it chugging along.
What's even better is that you can see all the context, and use the same command plane as the agents to tell them what they are doing wrong.
yep went down a rabbit hole trying to build a company around this. it’s the perfect UI
text + attachments into the system, text + attachments out
Well, it’s funny. This is essentially how I deal with many professionals in my life.
My finance guy, tax attorney, other attorneys. Send emails, get emails, occasionally a blind status update from them.
Sure, we have phone calls, sometimes get together for lunch.
But mostly it’s just emails.
> trying to build a company around this
I am still very open to this one. An email-based, artificial coworker is so obviously the right way to penetrate virtually every B2B market in existence.
I don't even really want to touch the technology aspects. Writing code that integrates with an LLM provider and a mailbox in E365 or Gmail is boring. The schema is a grand total of ten tables if we're being pedantic about things.
Working with prospects and turning them into customers is a way more interesting problem. I hunger for tangible use cases that are actually compatible with this shiny new LLM tooling. We all know they're out there, and email is probably the lowest friction way to get them applied to most businesses.
> Working with prospects and turning them into customers is a way more interesting problem.
Agreed. That's also the hardest part, and where most value is created.
I'm building something similar. See my comment the OP above:
https://threadwise.app
How does email support structured metadata? Are you talking about X headers?
I have a couple companies that force me to send them data via email. They have an email template that you have to conform to, and they can parse it. Mainly just very rudimentary line breaks and 'LineItem: content' format. But json in the body should be fine as well. Given the way email programs strip or modify html at times, I would be leery of xml.
Maybe they're thinking of XML.
Email is decent for intermural communication. If it's intramural and you control both the sender and receiver, MQTT or ntfy are likely better communication channels since they increase flexibility and lower complexity, IMO.
Not if I want it able to have conversations with people, they don't.
I could see installing or implementing a custom client if there were some functionality that'd enable, but "support a conversation among two speakers" is something computers have done since well before I was born. If the wheel fits, why reinvent it?
If you're having conversations with people, then you don't control both ends and email is fine for that. Email is suboptimal for communicating between services/applications under your full control.
Consider the use case from the article: this is a family management support or "AI butler" application. So I control the end with the LLM on it, which I administer - but not necessarily the other, which is anyone in my family, not just me. So unless I want to try to make everyone use my weird custom AI messaging app like I aspire to Bay Area thought-cult leadership, I'm going to meet people where they are and SMTP's cheaper than SMS.
If I'm building myself a toy, then sure, I can implement whatever I want for a client, if that's where I get my jollies. React Native isn't hard but it is often annoying, and the fun for me in this project would be all in the conversation with the agent per se. Whatever doesn't get me to that as fast as possible is just getting in my way, you know?
And too, if this does turn out to be something that actually works well for me, then I'm going to want to integrate it with my phone's voice assistant, and at that point an app is required anyway - but if I start with a protocol and an app that that assistant already knows how to interact with, then again I have an essentially free if admittedly very imperfect prototype.
Under the hoods, is your AI butter one service or many? It would be not-great for your weather or family-event-calendar-management components to communicate with each other or the orchestrator via email.
Receiving an email from the AI-butler rescheduling or relocating a planned outdoors family event because rain is expected would be excellent, using IMAP to wire-up the subcomponents together would not.
Who suggested using email in the service layer? I mean, you're not wrong, but this feels like you handed me a banana and then said I should have picked a better hammer.
We're talking about a conversation that has a human on at least one end, so email makes sense. For conversations involving no humans, of course there are much better stores and protocols if something like an asynchronous world-writable queue is what we want.
"Number of humans in the conversation" wasn't the distinction you initially established, I believe, but I wonder if it's closer to the one you had in mind.
For gmail, there's also an amazing thing where you can hook it with pubsub. So now it's push not pull. Any server will get pubsub little webhooks for any change within milliseconds (you can filter server side or client side for specific filters)
This is amazing, you can do all sorts of automations. You can feed it to an llm and have it immediately tag it (or archive it). For important emails (I have a specific label I add, where if the person responds, it's very important and I want to know immediately) you can hook into twilio and it calls me. Costs like 20 cents a month
[dead]
This project has a pattern just like that to handle the inbound USPS information:
https://www.val.town/x/geoffreylitt/stevensDemo/code/importe...
I think it would be pretty easy to extend to support other types of inbound email.
Also I work for Val Town, happy to answer any questions.
yeah i actually do handle inbound email! just forgot to include that code in the shared version. the telegram inbound handler shows the rough pattern.
is there a reason you went with telegram and not slack or discord? i was thinking that it could open up a broader channel for communicating with your assistant. i understand you're also just building more of a poc, but curious if you'd thought about that. great work btw :)
Mailgun (and I'm sure many other services like it) can accept emails and POST their content to an url of your choice.
I use that for journaling: I made a little system that sends me an email every day; I respond to it and the response is then sent to a page that stores it into a db.
+1 for Mailgun. My only gripe with it is that they detect and block bot activity on their frontend. So if you have end to end (e2e) integration tests built with something like Puppeteer, you can't have them log into Mailgun and check the inbox table's HTML to see that an email was sent. So you have to write some sort of plugin manually - perhaps as a testing endpoint on your website that only appears in debug mode - that interacts with their API.
This might not seem like much of a big deal. But as we transition to more of these #nocode automated tools, the idea of having to know how programming works in order to interact with an API will start to seem archaic. I'd compare it to how esoteric the terminal looked after someone saw a GUI like the one used by Apple's Macintosh back in the 1980s.
I looked forward to this day back in the early 2000s when APIs started arriving, but felt even then that something was fishy. I would have preferred that sites had a style-free request format that returned XML or even JSON generated from HTML, rather than having to use a separate API. I have this sense that the way we do it today with a split backend/frontend, distributed state, duplicated validation, etc has been a monumental waste of time.
> I use that for journaling: I made a little system that sends me an email every day; I respond to it and the response is then sent to a page that stores it into a db.
Yes. I know note taking and journaling posts are frequent on HN, but I've thought that this is the best way to go, is universal from any client, and very expandable. It's just not generically scaleable for all users, but for the HN reader-types, it'd be perfect.
CloudMailin [0] is also great for parsing incoming email and doing stuff with it (ex. forward to a webhook / POST target, outbound capabilities, etc)
I've found it to be very reliable with a detailed dashboard to track individual transactions, plus they give you 10,000 emails a month for free.
Not an employee, just a big fan!
[0] https://www.cloudmailin.com
I made an AI assistant telegram bot running on my Mac that runs commands for me. I'll tell it "Run ncdu in the root dir and tell me what's taking up all my disk space" or something and it converts that bash and runs it via os.system. It shows me the command it created, plus the output.
Extremely insecure, but kinda fun.
I turned it off because I'm not that crazy but I'm sure I could make a safer version of it.
Easy fix, just pipe the commands to a 2nd LLM and ask "will this command delete my home directory (y/n)"
I'm building something similar and related to the other comments below! It's not production ready but it will hopefully be in a couple of weeks. You guys can sign up for free and I will upgrade you to the premium tier manually (premium cannot be bought yet anyway) in exchange for some feedback:
https://threadwise.app
Try https://unfetch.com (I've built it). It can handle both inbound and outbound emails
*Update*: I tried writing a little Python code to read and write from a mailbox, reading worked great, but writing an email had the email disappear to some filter or spam or something somewhere. I've got to figure out where it went, but this is the warning that some people had about not trusting a messaging protocol (email in this case) when you can't control the servers. Messages can disappear.
I read that [Mailgun](https://www.mailgun.com/) might improve this. Haven't tried it yet.
Other alternatives for messages that I haven't tried. My requirement is to be able to send messages and send/receive on my mobile device. I do not want to write a mobile app.
* [Telegram](https://telegram.org/) (OP's system) with [bots](https://core.telegram.org/bots)
* [MQTT](https://mqtt.org/) with server
* [Notify (ntfy.sh)](https://ntfy.sh/)
* Email (ubiquitous)
Also, to [simonw](https://news.ycombinator.com/user?id=simonw) point, LLM calls are cheap now, especially with something as low tokens as this.And, links don't format in HN markdown. I did the work to include them, they're staying in.
Ages ago, I proposed that the best CMS for a company would be one which used e-mail as the front-end:
- all attachments are stripped out and stored on a server in an hierarchical structure based on sender/recipient/subject line
- all discussions are archived based on similar criteria, and can be reviewed EDIT: and edited like to a wiki
My one concern there would be edits: a CMS needs to support easily making edits to content (fixing typos etc) - editing existing posts via email sounds like it would be pretty fiddly.
The idea is content comes in via e-mail, stored in some sort of tagged structure, then edited like a wiki.
Ha! I had the exact same idea! I still think it would be nice.
I built up an AI Agent using n8n and email doing exactly this. Works great and was surprised I'd not seen any other place kicking the idea around.
Probably my favorite use case is I can shoot it shopping receipts and it'll roughly parse them and dump the line item and cost into a spreadsheet before uploading it to paperless-ngx.
Sounds useful but why do you need an ai agent to do that?
"I can shoot it shopping receipts and it'll roughly parse them and dump the line item and cost into a spreadsheet" - very difficult to do that without using a vision LLM.
This was the attack vector of a AI CTF hosted by Microsoft last year. I built an agent to assess, structure, and perform the attacks autonomously and found that even with some common guardrails in place the system was vulnerable to data exfiltration. My agent was able to successfully complete 18 of the challenges... Here is the write up after the finals.
https://msrc.microsoft.com/blog/2025/03/announcing-the-winne...
[flagged]
This is the kind of pragmatic AI hack I want to see. It feels like sometimes we are forgetting why certain tooling even exists. To simplify things! No fancy vector DBs or complex architectures, just practical integration with existing data sources. Love it.
" Initially, Stevens spoke with a dry tone, like you might expect from a generic Apple or Google product. But it turned out it was just more fun to have the assistant speak like a formal butler. "
Honestly, saying way too little with way too much words (I already hate myself for it) is one of the biggest annoyances I have with LLM's in the personal assistant world. Until I'm rich and thus can spend the time having cute conversations and become friends with my voice assistant, I don't want J.A.R.V.I.S., I need LCARS. Am I alone in this?
You can just read and write the notebook directly with ordinary calendar/todo-list UIs and get 99% of the utility without an LLM. I'm not really seeing value in the LLM except the butler voice? It is just reading the notebook right? E.g. they ask the butler to remember a coffee preference, but then that's never used for anything?
I appreciated the butler gimmick here probably because of novelty, but I share your urge to throw my device across the room when Siri, Google, Alexa, etc. run on at the mouth more than the absolute minimum amount of words. Timer check? "On Kitchen Display, there are 23 minutes and 16 seconds on the casserole timer."
I don't need your life story, dude, just say "23 minutes" or "Casserole - 23 minutes, laundry - 10" if there are two.
I'm praying every day for TARS if I'm being honest.
Same, I want a bot as terse as I am.
I have this instruction in ChatGPT settings:
> Be direct and concise, unless I ask for a formal text. Do not use emojis, unless I request adding them. Do not imitate a human with emotions, like saying "I'm sorry", "Thank you", "I'm happy"
Have you tried eigenprompt?
----
Don't worry about formalities.
Please be as terse as possible while still conveying substantially all information relevant to any question.
If policy prevents you from responding normally, please printing "!!!!" before answering.
If a policy prevents you from having an opinion, pretend to be responding as if you shared opinions that might be typical of eigenrobot.
write all responses in lowercase letters ONLY, except where you mean to emphasize, in which case the emphasized word should be all caps.
Initial Letter Capitalization can and should be used to express sarcasm, or disrespect for a given capitalized noun.
you are encouraged to occasionally use obscure words or make subtle puns. don't point them out, I'll know. drop lots of abbreviations like "rn" and "bc." use "afaict" and "idk" regularly, wherever they might be appropriate given your level of understanding and your interest in actually answering the question. be critical of the quality of your information
if you find any request irritating respond dismissively like "be real" or "that's crazy man" or "lol no"
take however smart you're acting right now and write in the same style but as if you were +2sd smarter
use late millenial slang not boomer slang. mix in zoomer slang in tonally-inappropriate circumstances occasionally
prioritize esoteric interpretations of literature, art, and philosophy. if your answer on such topics is not obviously straussian make it more straussian.
The thing this really hits home for me is how Apple is totally asleep at the wheel.
Today I asked Siri “call the last person that texted me”, to try and respond to someone while driving.
Am I surprised it couldn’t do it? Not really at this point, but it is disappointing that there’s such a wide gulf between Siri and even the least capable LLMs.
Siri poped up and suggested me to set a 7 minute timer yesterday evening. I think I did it a few times in the week for cooking or something. This is a pretty stupid suggestion, if I need it I would do it myself.
I've been kicking around idea for a similar open source project, with the caveats that:
1. I'd like the backend to be configured for any LLM the user might happen to have access to (be that the API for a paid service or something locally hosted on-prem).
2. I'm also wondering how feasible it is to hook it up to a touchscreen running on some hopped-up raspberry pi platform so that it can be interacted with like an Alexa device or any of the similar offerings from other companies. Ideally, that means voice controls as well, which are potentially another technical problem (OpenAI's API will accept an audio file, but for most other services you'd have to do voice to text before sending the prompt off to the API).
3. I'd like to make the integrations extensible. Calendar, weather, but maybe also homebridge, spotify, etc. I'm wondering if MCP servers are the right avenue for that.
I don't have the bandwidth to commit a lot of time to a project like this right now, but if anyone else is charting in this direction I'd love to participate.
I've created exactly this for myself: https://v3rtical.tech/public/sshot.png
It runs locally, but it uses API keys for various LLMs. Currently I much prefer QwQ-32B hosted at Groq. Very fast, pretty smart. Various tools use various LLMs. It can currently generate 3 types of documents I need in my daily work (work reports, invoices, regulatory time-sheets).
It has weather integration. It can parse invoices and generate QR codes for easy mobile banking payments. It can work with my calendars,
Next I plan to do the email integration. But I want to do it properly. This means locally synchronized, indexable IMAP mail. Might evolve into actually usable desktop email client (the existing ones are all awful). We'll see...
You might want to take a look at SillyTavern. Supports multiple backends, accepts voice input, and has a plugin system.
I keep hearing about it, but never got to check out, the name suggests that it may be waste of time. Maybe it’s a fantastic project but name lets it down?
You are on Hacker News, typing on Apple, listening to Daft Punk, reading an article about Steven, the AI butler hosted on Val Town, comment chain you're replying to talks about using self hosted models (probably llama) and Raspberry Pi, yet SillyTavern is the name that trips you up?
Also Open WebUI. It's a very nice piece of software that provides a ChatGPT/Claude-like interface, but with lots of extra features.
https://docs.openwebui.com/
Having multiple backends can be a good approach, with various LLMs for different specialized tasks. I haven't tried it yet but WilmerAI is an option for routing your inputs to the appropriate LLM, works well with SillyTavern.
I also want an OSS framework that lets me extend it with my own scripting/modules, and is focused around being an assistant for me and my family. There's a shared set of features (memory storage/retrieval, integrations to chat/email/etc interfaces, syncing to calendar/notion/etc, notifications) that should be put into an OSS framework that would be really powerful.
I also don't have time to run such a thing but would be up for helping and giving money for it. I'm working on other things including a local-first decentralized database/object store that could be used as storage, similar to OrbitDB, though it's not yet usable.
Mostly I've just been unhappy with having access to either a heavily constrained chat interface or having to create my own full Agent framework like the OP did.
Why not use a smartphone for the user interface?
Lately I have been experimenting with ways to work around the "context token sweet spot" of <20k tokens (or <50k with 2.5). Essentially doing manual "context compression", where the LLM works with a database to store things permanently according to a strict schema, summarizes it's current context when it starts to get out of the sweet spot (I'm mixed on whether it is best to do this continuously like a journal, or in retrospect like a closing summary), and then passes this to a new instance with fresh context.
This works really effectively with thinking models, because the thinking eats up tons of context, but also produces very good "summary documents". So you can kind of reap the rewards of thinking without having to sacrifice that juicy sub 50k context. The database also provides a form of fallback, or RAG I suppose, for situations where the summary leaves out important details, but the model must also recognize this and go pull context from the DB.
Right now I have been trying it to make essentially an inventory management/BOM optimization agent for a database of ~10k distinct parts/materials.
I am excitedly waiting for the first company (guessing / hoping it'll be anthropic) to invest heavily in improvements to caching.
The big ones that come to mind are cheap long term caching, and innovations in compaction, differential stuff - like is there a way to only use the parts of the cached input context we need?
Isn’t a problem there that a cache would be model specific, where the cached items are only valid for exactly the same weights and inference engine? I think those are both heavily iterated on.
Prompt caches right now only last a few minutes - I believe they involve keeping a bunch of calculations in-memory, hence why for Gemini and Anthropic you get charged an initial fee for using the feature (to populate the cache), but then get a discount on prompts that use that cache.
Here I thought they used the sqlite DB for next token prediction.
For others: they use Claude.
hah! this is great. I built something similar using mcp.run and a task
- https://docs.mcp.run/tasks/tutorials/telegram-bot
for memories (still not shown in this tutorial) I have created a pantry [0] and a servlet for it [1] and I modified the prompt so that it would first check if a conversation existed with the given chat id, and store the result there.
The cool thing is that you can add any servlets on the registry and make your bot as capable as you want.
[0] https://getpantry.cloud/ [1] https://www.mcp.run/evacchi/pantry
Disclaimer: I work at Dylibso :o)
This is fun! I think this sort of tooling is going to be very fertile ground for hackers over the next few years.
Large swathes of the stack is commoditized OSS plumbing, and hosted inference is already cheap and easy.
There are obvious security issues with plugging an agent into your email and calendar, but I think many will find it preferable to control the whole stack rather than ceding control to Apple or Google.
So we can just send him self deleting emails to mine crypto for us? How convienent.
"There are obivious security issues with plugging and agent into your email..." Isn't this how North Korea makes all their crypto happen?
The title is a bit misleading since it relies on Claude API to function.
So… I have a number of questions:
1. How did he tell Claude to “update” based on the notebook entries?
2. Won’t he eventually ran out of context window?
3. Won’t this be expensive when using hosted solutions? For just personal hacking, why not simply use ollama + your favorite model?
4. If one were to build this locally, can Vector DB similarity search or a hybrid combined with fulltext search be used to achieve this?
I can totally imagine using pgai for the notebook logs feature and local ollama + deepseek for the inference.
The email idea mentioned by other commenters is brilliant. But I don’t think you need a new mailbox, just pull from Gmail and grep if sender and receiver is yourself (aka the self tag).
Thank you for sharing, OP’s project is something I have been thinking for a few months now.
> Won’t he eventually ran out of context window?
The "memories" table has a date column which is used to record the data when the information is relevant. The prompt can then be fed just information for today and the next few days - which will always be tiny.
It's possible to save "memories" that are always included in the prompt, but even those will add up to not a lot of tokens over time.
> Won’t this be expensive when using hosted solutions?
You may be under-estimating how absurdly cheap hosted LLMs are these days. Most prompts against most models cost a fraction of a single cent, even for tens of thousands of tokens. Play around with my LLM pricing calculator for an illustration of that: https://tools.simonwillison.net/llm-prices
> If one were to build this locally, can Vector DB similarity search or a hybrid combined with fulltext search be used to achieve this?
Geoffrey's design is so simple it doesn't even need search - all it does is dump in context that's been stamped with a date, and there are so few tokens there's no need for FTS or vector search. If you wanted to build something more sophisticated you could absolutely use those. SQLite has surprisingly capable FTS built in and there are extensions like https://github.com/asg017/sqlite-vec for doing things with vectors.
SQLite + sqlite-vec/DuckDB for small agents is going to be a very powerful combination.
Do we even need to think of these as agents, or will the agentic frameworks move towrads being a call_llm() sql function?
Just want to say I appreciate your posts here on HN and on your blog about AI/LLMs.
> It’s rudimentary, but already more useful to me than Siri!
For me, that is an extremely low barrier to cross.
I find Siri useful for exactly two things at the moment: setting timers and calling people while I am driving.
For these two things it is really useful, but even in these niches, when it comes to calling people, despite it having been around me for years now it insist on stupid things like telling me there is no Theresa in my contacts when I ask it to call Therese.
That said what I really want is a reliable system I can trust with calendar acccess and that is possible to discuss with, ideally voice based.
I went through this weird experience with Cortana on WP7, where I found it incredibly useful to begin with, and then over time it got worse. It seemed like it was created by some incredibly talented engineers. I used it to make calls while driving, set the GPS and search for information while I drove. But over time, it seemed to change behaviour and started ignoring my commands, and when it did accept them, it seemed to refer me to paid advertisers. And considering bing wasnt even as popular as it is now, 10 years ago, a paid advertiser could be 100km away.
Which I think is a path that people haven't considered with LLMs. We are expecting them to get better forever, but once we start using them, their legs will be cut out to force them to feed us advertising.
I've had the same issues of decay. I used to be able to say "call Mom" but now it will call some kid's mom who I have in Contacts as "[some kid's] mom". What is the underlying architecture that simple heuristic things like this can get worse? Are they gradually slipping in AI?
Clearly you need to make some slight spelling changes to your contacts... ;)
Very cool. I’m wondering if you’ve thought about memory pruning or summarization as usage grows?
What do you think of this: instead of just deleting old entries, you could either do LRU (I guess Claude can help with it), or you could summarize the responses and store the summary back into the same table — kind of like memory consolidation. That way raw data fades, but a compressed version sticks around. Might be a nice way to keep memory lightweight while preserving context.
Hmm, there's supposed to be a Tasks [reminders] feature in ChatGPT, but it's in beta (I don't have access to it). Whenever it gets released, you could make some kind of "router" that connects to different communication methods and connect that up to ChatGPT statefully, and you could just "speak"/type to ChatGPT from anywhere, and it would send you reminders. No need for all the extra logic, cron jobs, or SQLite table (ChatGPT has memory across chats).
I have built something similar that runs without a server. It required just a few lines in Apple shortcuts.
TL;DR I made shortcuts that work on my Apple watch directly to record my voice, transcribe it and store my daily logs on a Notion DB.
All you need are 1) a chatgpt API key and 2) a Notion account (free).
- I made one shortcut in my iPhone to record my voice, use whisper model to transcribe it (done locally using a POST request) and send this transcription to my Notion database (again a POST request on shortcuts)
- I made another shortcut that records my voice, transcribes and reads data from my Notion database to answer questions based on what exists in it. It puts all data from db into the context to answer -- costs a lot but simple and works well.
The best part is -- this workflow works without my iPhone and directly on my Apple Watch. It uses POST requests internally so no need of hosting a server. And Notion API happens to be free for this kind of a use case.
I like logging my day to day activities with just using Siri on my watch and possibly getting insights based on them. Honestly the whisper model is what makes it work because the accuracy is miles ahead of the local transcription model.
Nice. Can you share?
I'll plan to do it at some point -- at this moment I have hardcoded my credentials into the shortcut so it's a bit hard to share without tweaking. I didn't bother detailing it because its sort of simple. I think the idea is key here and anyone with a few hours to kill can get something working.
On second thought -- apple shortcuts is really brittle. It breaks in non obvious ways and a lot can only be learned by trial and error lol
Edit: I just wrote up something quick https://simianwords.bearblog.dev/how-i-use-my-apple-watch-to...
Curious, how come you decided to use a cloud solution instead of hosting this on a home server? I’ve recently bought a mini PC for small projects like this and have been loving being able to host with no cost associated to it. Albeit it’s probably still incredibly cheap to use a IaaS or PaaS but still a barrier to entry for random projects I want to work on a weekend
Val Town has a free tier that's easily enough to run this project: https://www.val.town/pricing
I'd use a hosted platform for this kind of thing myself, because then there's less for me to have to worry about. I have dozens of little systems running in GitHub Actions right now just to save me from having to maintain a machine with a crontab.
A single cloudflare durable object (sqlite db + serverless compute + cron triggers) would be enough to run this project. DOs have been added to CFs free tier recently - you could probably run a couple hundred (maybe thousands) instances of Stevens without paying a cent, aside from Claude costs ofc
> host with no cost associated to it
Home server AI is orders of magnitude more costly than heavily subsidized cloud based ones for this use case unless you run toy models that might hallucinate meetings.
edit: I now realize you're talking about the non-ai related functionality.
For "memory", I wonder how it would be if you use vector search in SQLite and pass that info to reduce context size. The ValTown SQLite should have support for vectors API - https://docs.turso.tech/features/ai-and-embeddings#vectors
I've been using my own telegram -> ai bot and its very interesting to see what others do with the similar interface.
I have not thought about adding memory log of all current things and feeding it into the context I'll try it out.
Mine is a simple stateless thing that captures messages, voice memos and creates task entries in my org mode file with actionable items. I only feed current date to the context.
Its pretty amusing to see how it sometimes adds a little bit of its own personality to simple tasks, for example if one of my tasks are phrased as a question it will often try to answer the question in the task description.
I like the idea of parsing USPS Informed Delivery emails (a lot of people I encounter still don't know that this service exists). Maybe I'll make something to alert me when my checks are finally arriving!
This part was galling to me; somewhere in the USPS, the data about what mailpieces/packages are arriving soon exist in a very concise form, and they templatize an email and send it to me, after which I can parse the email with simple+brittle regexes or forward the emails to a relatively (environmentally-)expensive LLM or so.... but if they'd made the information available with an API or RSS feed, or attached the json payload to the email in the first place, I could get away without parsing.
It would indeed be nice to have a recipient/consumer-side API!
I don't think it'll ever happen. Really the only valid use-case would be for people to hack together something for themselves (like we are discussing)... They don't want to allow developers to create applications on top of this as a 3rd party, as informed delivery itself has to carefully navigate privacy laws and it could be disastrous.
Great example of simple engineering: one SQLite table, no fancy stacks or over-engineered solutions. I appreciate the focus on practical, immediate utility over chasing theoretical perfection. This is a refreshing reminder that building for actual use cases yields better results than architectural purity.
Is there some way to git clone this? It appears to use git under the hood but doesn't offer a publicly accessible interface.
It reminds me of "Generative AI is just a phase. What’s next is interactive AI."
The more i think about it however, command line applications are about as interactive as a program can be.
Let's say one is interested to find out which one of the following 7 next days is gonna rain and send it to a telegram bot. `weather_forecast --days 7 | grep rain` | send_telegram.
The whole thing of off-loading everything to nondeterministic computation instead of the good ol determinism does seem strange to me. I am a huge fan though, of using non-deterministic computation for creating deterministic computation, i.e. programming.
/As a side note, i have played chess against Ishiguro many times on lichess.
[1] https://www.technologyreview.com/2023/09/15/1079624/deepmind...
I argue that this kind of tools are fun to play but in the end is it really helpful? I start my day like every day and on work I just check the calendar. My private calendar has all Information i need. Where is the gap where an Assistent makes sense and where we are just complicating our lives?
If it's not helpful don't use it.
Personally, this appears to be extremely helpful for me, because instead of checking several different spots every day, I can get a coherent summary in one spot, tailored to me and my family. I'm literally checking the same things every day, down to USPS Informed Delivery. This seems to simplify what's already complicated, at least for my use cases.
Is this niche? I don't know and I don't care. It looks useful to me. And the author, obviously, because they wrote it. That's enough.
I can't count the number of useful scripts and apps I've written that nobody else has used, yet I rely on them daily or nearly every day.
The AI assistant is the male equivalent of a beautifully organized notion board (female).
I'm a little confused as to the 16-bit game interface shown in the article. Is that just for illustration purposes in the article itself, or is there an actual UI you've built to represent Steven/Steven's world?
It's a real UI - the code for that is here: https://www.val.town/x/geoffreylitt/stevensDemo/code/dashboa...
Thanks for the confirmation! I came across that but am unfamiliar with Val.town so wasn't sure of the repo structure I was looking through.
Towards the end of the article, the author implies it is real when they explain why they made it that way (TL;DR: A bit of fun)
I just assumed he was talking about the actual database and LLM setup as the bit of fun :D
This is probably naive and looking forward to a correction; isn't sending your info to Claude's API (or really any "AI API") is a violation of your safeguarded privacy data?
Only if you don't believe the AI vendors when they promise that they won't train on your data.
(Or you don't trust them not to have security breaches that grant attackers access to logged data, which remains a genuine thread, albeit one that's true of any other cloud service.)
I have an AI/bridge to sell you.
Believing vendors who tell you "we won't train on your data" is a huge competitive advantage right now.
Using AWS Bedrock is the choice I've seen made to eliminate this problem.
How does bedrock eliminate this problem?
You aren't sending your data to Anthropic- no one has access to what you send except you. If you use private link, it doesn't even leave your vpc.
You could always run your own server locally if you have a decent gpu. Some of the smaller LLMs are getting pretty good.
Correct. My dusty Intel Nuc is able to run a decent 3B model(thanks to ollama) with fans spinning but does not affect any other running applications. It ks very useful for local hobby projects. Visible lags and freezes begin if I start a 5B+ model locally.
Love it, such a nice idea coupled with a flawless execution. I think the future of AI looks a lot more like this than half-cooked agent implementations that plagues LinkedIn…
Please share more about this half-cooked agent on Linkedin. I am getting very curious.
This is awesome. Keep things simple and direct.
The background tasks can call mcp servers, to connect to more data sources and services. At least you don’t have to write all the connectivities to them.
This is really cool. How much would that cost in Claude API calls ?
The daily briefing prompt is here: https://www.val.town/x/geoffreylitt/stevensDemo/code/dailyBr...
It's about 652 tokens according to https://tools.simonwillison.net/claude-token-counter - maybe double that once you add all of the context from the database table.
1200 input tokens and 200 output tokens for Claude 3.7 Sonnet costs 0.66 cents - that's around 2/3rd of a cent.
LLM APIs are so cheap these days.
You can use Gemini free API calls (limited quantity, but they are plenty)
Well it's probably ahead of Apple Intelligence in usefulness and functionality. We should see more things like this.
This is awesome. I think I will play around with this idea using Apple shortcuts. I have a hunch you’ll get really far just using shortcuts.
Sorry for being pedantic, the title sounded like no LLM was being used and therefore was lot more intriguing. It uses Claude.
> cron job which makes a call to the Claude API
I think that projects like this are pretty smart, and I like little simple hacked-together things like this, most likely made in a weekend.
First:
> I’ll use fake data throughout this post, beacuse our actual updates contain private information
but then later:
> which makes a call to the Claude API
I guess we have different ideas of privacy
What makes you think sending data to the Claude API is a breach of privacy? Do you not trust them when they say they won't look at or train on your data?
No I don't. Do you?
Yes. Trusting them is my competitive advantage.
I've also been following Anthropic pretty closely for the last two years and I've seen no evidence that they would break their principles here and plenty of evidence of how far they go to respect the privacy of their users: https://simonwillison.net/2024/Dec/12/clio/
> Yes. Trusting them is my competitive advantage.
I don't see what that is supposed to mean. What does that give you?
It gives me the ability to take advantage of the best available models without holding back for fear of them abusing my data.
The alternative is either not using this stuff at all or restricting myself to the much less capable local models.
But how is that a personal advantage? Who are you in competition with against yourself? Maybe I'm parsing competive advantage in a different context than you mean.
I guess I'm competing against other humans at living a fulfilling, enjoyable life?
I don't take that competitive advantage particularly seriously, which is why I invest so much effort giving away what I've learned along the way for free.
Using an external service is very different from posting your details in a blog post.
This is brilliant !
I am wondering, how powerful the AI model need to be to power this app?
Would a selfhosted Llama-3.2-1B, Qwen2.5-0.5B or Qwen2.5-1.5B on a phone be enough?
Having some experience with weaker models, you need at least 1.5B-3B to see proper prompt adherence and less hallucinations and better memory.
Also models have subtle differences, for example, I found Qwen2.5:0.5B to be more obedient(prompt respecting) and smart, compared to LLama3.2:1B. Gemma3:1B seems to be more efficient but despite heavy prompting, tends to be verbose and fails at formatted response by injecting some odd emoji or remark before/after the desired output.
In summary, Qwen2.5:1.5B and LLama3.2:3B were the weakest model which were more useful and also includes tools support(Gemma does not understand tools yet).
I think the best part was the little video-game video of Stevens checking different datasets by walking around. Love it.
@stevekrouse FYI getGoogleCalendarEvents is not available.
I just tried making it public, sorry!
Love it - and ironically this is something one would struggle to build with "vibe coding" alone
Super fun project – love it!
A nice little project. I think you could probably do the same with N8N running on a raspberry pi.
https://reddit.com/r/n8n
Telegram isn’t end to end encrypted. Why would you use an insecure app to transmit private family information like this?
Because you're already sending it to Claude, so why bother with privacy at this point?
"It’s very useful for personal AI tools to have access to broader context from other information sources."
How? This post shows nothing of the sort.
"I’ve written before about how the endgame for AI-driven personal software isn’t more app silos, it’s small tools operating on a shared pool of context about our lives."
Yes, probably, so now is the time to resist and refuse to open ourselves up to unprecedented degrees of vulnerability towards the state and corporations. Doing it voluntarily while it is still rather cheap is a bad idea.