Personal AI Project - update
Is web search and summarization (with iteration) is how we actually do research? How would an AI Agent that is more similar to how one researches would look like?
Introduction
In the previous post, I started from an AI Agent designed for web-based research. Initially, it relied on Google Gemini and Google Search, but I modified it to use Qwen3 (or other self-hosted models of your choice) and DuckDuckGo.
Conceptually, the workflow looked like this:
Upon reflection, I realized that this approach was somewhat naive and didn’t fully capture how I actually conduct research. So, I began by mapping out my internal research flow—as much as I’m consciously aware of it. I also wanted to preserve its state-machine structure for various reasons.
The result is a bit more chaotic but aligns more closely with how I process information. It reflects how I recall certain things in a vague, hint-driven manner (search cues), how I leverage bookmarks and documents, and, more importantly, how I use mental models and a conceptual graph of ideas, relationships, and attributes—one might even call it a mental ontology.
With that in mind, I decided to implement the lowest-hanging fruit—memory. For those familiar with AI Agents, it is one of the most common components. Typically, this component involves ingesting data into a Vector DB and retrieving it based on similarity, which naturally leans toward a certain “vagueness.”
Implementation
I was less surgical this time, making more substantial changes to the backend. I also began addressing the technical debt of the original project from Google.
Check it out: https://github.com/eyal-lantzman/personal-ai
Graph State: About 40% of the changes focused on state management:
Initial State – Inputs to the agent
Intermediate State – The core state of the agent, where everything converges
Final State – The agent’s output
I also introduced BaseState since messages are needed across all states.
Additionally, I added OverallState, which provides LangGraph with complete state information, allowing it to seamlessly manage channels as inputs/outputs for various methods.
Graph Execution: Another 40% of the changes focused on execution:
Added reconciliation with langmem for memory search and management, using InMemoryStore for the current implementation.
Introduced parallel recollection, running alongside web research
Added an inferred region to web research, based on language, spelling, style, and location data
Observations and Learnings
As I reflected on my thought process, I realized that my career journey has shaped the way I approach problems—or at least how I describe them. Concepts like ontology, graphs, and reconciliation have become central to my reasoning. This led me to wonder: how do others think about and articulate their research processes?
LangGraph’s approach to managing data through channels that map to fields in state objects (TypedDict or Pydantic) is quite powerful and interesting. However, it requires a fairly deep understanding of low-level LangGraph mechanics to avoid data loss. You can learn more here.
LangMem is a small wrapper around LangGraph stores, but it offers intriguing areas for exploration, such as prompt optimization and knowledge extraction.
Even if the AI Agent runs and generates output for me, how do I know I can trust its results?
At what points in the process should I intervene to ensure that the "mental graph" remains coherent?
I need to add more tests for the graph—the overall state machine.
Since AI is non-deterministic, I had to run some tests multiple times to establish a reasonable level of confidence in the results.
Future Work
Tests are high on the priority list, as complexity will only increase from here.
There are a few technical TODOs I need to revisit—some are obvious bugs, some are enhancements, and some are ideas for the future.
Then there are all the concepts from the introduction, which will require significant effort to refine and incorporate.
I also need to introduce human-in-the-loop verifications and clarifications, alongside reference checks. These are essential for ensuring I can trust the AI Agent’s output, especially in areas where my own knowledge is limited.
Wrap up
I used mindfulness to better understand my thought process, which helped me identify the capabilities I wanted in my personal AI project. However, it didn’t really help me pinpoint the ones necessary to trust an AI system’s outcomes—since I know how flawed its responses can be, even when they’re well-articulated.
Everything I write are my opinions and perspectives and do not represent my past, current of future employers.
Is the trust in the result from an AI a function of the result or a function of the AI? Meaning, we trust in answers on the basis of prior experience in relation to the previous answers that have been confirmed to be true. This implies a verification feedback loop. Generating multiple results and averaging over the set is a popular way of verification in AI, but that strikes me somehow as echo-chambery. Maybe breaking the walls of that chamber through use of different models for some generations and validation can help here. Ultimately, it seems, that one cannot verify the correctness of an answer without applying. It is shifting the focus towards the “undo” space - if one can undo the results for “free” there is no need to trust the answer prior. This assumes that we can verify the result of application is desired (evaluation over measurement in a different space) and that it can be undone at acceptable cost, which can not always be even possible - “let me amputate this limb; looks unnecessary - said AI” 😀. Introducing Human-In-The-Loop as an element of ensemble verification prior to execution is a cheat that can help in early stages. It can generalise better if we can identify qualified humans, but is really “just” an optimisation over the original done-by-human scenario.