Phase 1 of 6
Foundations
Mental models for how AI systems work
58%
15Done
7Partial
10Todo
◐CONCEPT
AI vs ML vs deep learning vs generative AI — the taxonomy
→Write a taxonomy doc comparing your 12+ projects by AI type — classify each (Dosh=GenAI text+vision, ServicePulse=ML classification pipeline, FPL BlackBox=GenAI+edge ML, Local LLM Stack=self-hosted inference)2 hours
- Create a spreadsheet with columns: project, AI type, model(s) used, input modality, output type
- Classify each project: pure GenAI, ML pipeline, hybrid, or rule-based with AI augmentation
- Write a 1-page summary explaining where each project sits on the AI taxonomy and why
- Identify gaps — which AI types are you not using yet (e.g. reinforcement learning)?
Google's ML Crash Course: https://developers.google.com/machine-learning/crash-courseStanford CS229 lecture 1 (Andrew Ng) — the taxonomy explained in 20 minutes
○CONCEPT
Neural networks: weights, activations, backpropagation
→Follow Karpathy's Zero to Hero series, then train a tiny classifier on your Dosh transaction categories using PyTorch on the Mac mini1 week evenings
- Watch Karpathy's 'The spelled-out intro to neural networks' (2.5 hours) — code along in a Jupyter notebook
- Export 500 labelled Dosh transactions from SQLite with their categories
- Build a simple 2-layer neural network in PyTorch that classifies transactions by category
- Train it on Mac mini, plot loss curves, and compare accuracy to your Haiku prompt-based categoriser
- Write up what backpropagation actually does in plain English — test by explaining it to someone
Karpathy's Neural Networks: Zero to Hero playlist: https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ3Blue1Brown 'Neural Networks' series (4 videos, ~1 hour total): https://www.3blue1brown.com/topics/neural-networksPyTorch 60-minute blitz: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
◐CONCEPT
Supervised / unsupervised / reinforcement learning
→Your Dosh categoriser is supervised learning (labelled transactions). Add k-means clustering (unsupervised) to automatically group similar transactions and discover spending patterns you haven't labelled yet1 afternoon
- Export Dosh transactions with embeddings (or generate embeddings with nomic-embed-text on Mac mini)
- Run k-means clustering (scikit-learn) with k=10,20,30 and compare cluster quality with silhouette scores
- Visualise clusters with UMAP or t-SNE — do they match your existing categories?
- Write a comparison: supervised (your Haiku categoriser) vs unsupervised (clusters) — when would you use each?
scikit-learn clustering guide: https://scikit-learn.org/stable/modules/clustering.htmlGoogle's ML Crash Course — Clustering module: https://developers.google.com/machine-learning/clustering
◐CONCEPT
Transformer architecture — why it matters
→Watch Karpathy's 'Let's build GPT' while running inference on your Mac mini with MLX — pause at each architecture component and map it to what's happening in your local Gemma 3 4B model1 afternoon
- Watch Karpathy's 'Let's build GPT from scratch' (2 hours) — take notes on each component
- Run Gemma 3 4B on Mac mini via MLX and inspect the model config (layers, heads, dimensions)
- Map paper concepts to real values: how many attention heads? What's the embedding dimension? How many parameters per layer?
- Write a 1-pager: 'What happens inside Gemma when I send it a prompt' — trace the full path
Karpathy's 'Let's build GPT': https://www.youtube.com/watch?v=kCc8FmEb1nYThe Illustrated Transformer by Jay Alammar: https://jalammar.github.io/illustrated-transformer/
✓CONCEPT
Tokens, embeddings, and vector spacesclick to expand
✓CONCEPT
Temperature, top-p, top-k — how sampling worksclick to expand
✓CONCEPT
Context windows: limits, growth, implicationsclick to expand
◐CONCEPT
Training vs inference — economics and trade-offs
→Build a cost dashboard aggregating inference spend across all your projects (Anthropic, OpenAI, Azure, Cloudflare). Compare monthly inference cost vs what fine-tuning Gemma 4B on Mac mini would cost in electricity2 hours
- Pull billing data from Anthropic Console, Azure Portal, and Cloudflare dashboard for the last 3 months
- Create a spreadsheet: project, provider, monthly tokens, monthly cost, cost per request
- Calculate Mac mini electricity cost for local inference (measure wattage during MLX inference, multiply by hours)
- Write a summary: 'My AI economics — where money goes and where local saves it'
Anthropic pricing page: https://www.anthropic.com/pricinga]i analysis 'The Inference Cost Equation' — search on AI Monitor RSS feeds
✓CONCEPT
Fine-tuning vs prompting vs RAG — when to use whichclick to expand
◐CONCEPT
RLHF — what it is and why it matters
→Read the Constitutional AI paper (Anthropic, 2022) — you reference it in your prompt library constantly. Your ha-log-agent's risk-based approval tiers are a practical form of alignment via human feedback1 hour
- Read the Constitutional AI paper — focus on sections 2 (method) and 4 (results), skip appendix (45 minutes)
- Map ha-log-agent's approval tiers to the paper's concepts: where is the 'constitution'? Where is the 'human feedback'?
- Write a short note: 'How RLHF connects to what I build' — reference Telegram bot, ha-log-agent, and prompt library
Constitutional AI paper: https://arxiv.org/abs/2212.08073Anthropic's RLHF blog post: https://www.anthropic.com/research/training-a-helpful-and-harmless-assistantChip Huyen's RLHF explainer: https://huyenchip.com/2023/05/02/rlhf.html
✓CONCEPT
GPT-4o/5, Claude, Gemini, Llama, Mistral, Qwen — capabilities and trade-offsclick to expand
✓CONCEPT
Open vs closed source modelsclick to expand
✓CONCEPT
Multimodal models — vision, audio, videoclick to expand
○CONCEPT
Model benchmarks: MMLU, HumanEval, HELM, LMSys Arena
→Run your FPL BlackBox chat prompt through 3 models (Claude Haiku, GPT-4o-mini, Llama 3.2 3B local), score outputs on accuracy and helpfulness. Compare your scores to public benchmarks to see where they align and diverge1 afternoon
- Create 10 FPL test prompts covering different complexities (simple lookup, multi-step reasoning, opinion)
- Run each through Claude Haiku (FPL BlackBox), GPT-4o-mini (via LiteLLM), and Llama 3.2 3B (Mac mini)
- Score each response 1-5 on accuracy, helpfulness, and latency. Create a comparison table
- Look up the same models on LMSys Chatbot Arena and MMLU leaderboard — where do your findings agree/disagree?
- Write a 1-pager: 'What benchmarks miss — lessons from my own evaluation'
LMSys Chatbot Arena leaderboard: https://chat.lmsys.org/?leaderboardHELM benchmark: https://crfm.stanford.edu/helm/latest/Hugging Face Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
○CONCEPT
Scaling laws — compute, data, parameters
→Use your AI Monitor RSS feeds to flag scaling laws papers. Read the Chinchilla paper and 2 follow-ups — your feeds already track DeepMind and Anthropic who publish on this3 hours
- Search AI Monitor for 'scaling laws' articles from the last 6 months — pick the 3 highest-scored
- Read the Chinchilla paper (Hoffmann et al., 2022) — focus on the key finding: optimal compute allocation
- Read one follow-up (e.g. 'Scaling Data-Constrained Language Models') to see how the field has evolved
- Write a summary: 'What scaling laws mean for my model choices' — why Gemma 4B exists, why Haiku is cheap
Chinchilla paper: https://arxiv.org/abs/2203.15556Scaling Data-Constrained LMs: https://arxiv.org/abs/2305.16264Epoch AI scaling laws dashboard: https://epochai.org/data/notable-ai-models
◐CONCEPT
Reasoning models: o3, DeepSeek-R1, extended thinking
→You already use extended thinking in Telegram bot and deepseek-r1:14b on Mac mini. Run a structured comparison: same 5 complex prompts through Claude extended thinking, DeepSeek-R1 14B local, and standard Claude Sonnet. Measure quality and latency2 hours
- Design 5 test prompts requiring multi-step reasoning (e.g. 'Plan my FPL transfers considering fixture swings, budget, and bench coverage')
- Run each through: Claude Sonnet (no thinking), Claude Sonnet (extended thinking), DeepSeek-R1 14B (Mac mini)
- Score outputs on reasoning depth, accuracy, and time-to-response
- Document when extended thinking helps vs when it's overkill — create a decision framework
Anthropic extended thinking docs: https://docs.anthropic.com/en/docs/build-with-claude/extended-thinkingDeepSeek-R1 paper: https://arxiv.org/abs/2401.12474
◐CONCEPT
SLMs — Phi, Gemma, Qwen 0.5B — when small beats big
→Your auto-router already uses Gemma 3 4B QAT. Benchmark it against llama3.2:3b and qwen3:8b on 20 of your actual RAG queries from the local doc watcher. Document when the smaller model wins on speed without losing quality2 hours
- Export 20 recent queries from your local LLM stack logs (variety of topics)
- Run each through Gemma 3 4B (MLX), Llama 3.2 3B (Ollama), and Qwen 3 8B (Ollama)
- Measure: tokens/second, response quality (1-5), and RAM usage for each
- Create a decision matrix: for which query types does smaller win? Where does 8B quality matter?
Microsoft Phi-3 technical report: https://arxiv.org/abs/2404.14219Google Gemma 3 model card: https://ai.google.dev/gemma/docs
✓STRATEGIC
How to evaluate and select models for use casesclick to expand
○CONCEPT
Attention mechanism — self-attention, multi-head, cross-attention
→Local LLM Stack — add attention weight visualisation to Open WebUI or use BertViz with a small model. MLX exposes attention weights from Gemma 3 during inference1 afternoon
- Watch 3Blue1Brown's 'Attention in transformers, visually explained' (26 min)
- Install BertViz (pip install bertviz) and visualise attention patterns on a small BERT model with your own text
- Use MLX's model inspection to extract attention weights from Gemma 3 4B on a sample prompt
- Compare attention patterns across layers — which heads attend to syntax vs semantics?
- Write a 1-pager explaining self-attention, multi-head, and cross-attention with your own examples
3Blue1Brown 'Attention in transformers': https://www.youtube.com/watch?v=eMlx5fFNoYcBertViz interactive attention visualisation: https://github.com/jessevig/bertvizThe Illustrated Transformer (attention section): https://jalammar.github.io/illustrated-transformer/
○CONCEPT
Positional encoding — absolute, relative, RoPE
→Karpathy's Zero to Hero covers positional encoding. Compare how Gemma 3 (uses RoPE) handles long context vs older absolute-position models. Your local stack can test this empirically2 hours
- Watch the positional encoding section of Karpathy's 'Let's build GPT' (timestamp ~45 min)
- Read the RoPE paper abstract + method section (15 minutes)
- Test Gemma 3 4B on prompts with key info at positions 100, 1000, and 3000 tokens — does retrieval accuracy degrade?
- Compare with Llama 3.2 3B (also RoPE) at the same positions — do they behave similarly?
RoPE paper (Su et al., 2021): https://arxiv.org/abs/2104.09864Karpathy's 'Let's build GPT' positional encoding section: https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2700sEleuther AI blog on positional encodings: https://blog.eleuther.ai/rotary-embeddings/
○CONCEPT
KV cache — what it is and why it matters for inference cost
→Local LLM Stack — measure KV cache memory consumption on Mac mini across different context lengths (512, 2048, 8192 tokens). Compare Gemma 3 4B vs Qwen 3 8B memory usage to see how model size affects cache2 hours
- Read the MLX documentation on KV cache management (10 minutes)
- Write a script that sends prompts of increasing length (512, 1024, 2048, 4096, 8192 tokens) to Gemma 3 via MLX
- Monitor memory usage with Activity Monitor or `memory_pressure` at each context length — record in a table
- Repeat with Qwen 3 8B via Ollama and compare memory growth curves
- Write up: 'Why KV cache matters for my local inference — memory budget planning'
MLX LM documentation on caching: https://github.com/ml-explore/mlx-examples/tree/main/llmsEfficient Transformers survey (Tay et al.): https://arxiv.org/abs/2009.06732
○CONCEPT
Mixture of Experts (MoE) — how Mixtral and GPT-4 likely work
→AI Monitor — pull Mixtral 8x7B via Ollama on Mac mini and compare against Gemma 3 4B on your RSS relevance scoring task. MoE models activate fewer parameters per token — measure the speed/quality trade-off3 hours
- Read the Mixtral paper abstract + architecture section (20 minutes)
- Pull mixtral:8x7b via Ollama (if RAM allows) or use a smaller MoE model
- Run 20 AI Monitor RSS articles through both Mixtral and Gemma — score relevance, measure latency
- Compare: does MoE's sparse activation give better quality-per-second than a dense model?
Mixtral paper: https://arxiv.org/abs/2401.04088Switch Transformers paper (the MoE foundation): https://arxiv.org/abs/2101.03961Hugging Face MoE explainer: https://huggingface.co/blog/moe
○CONCEPT
Speculative decoding — how it speeds up inference
→Local LLM Stack — enable MLX speculative decoding using a small draft model (Gemma 2B) to speed up Gemma 3 4B generation. Measure the tokens/second improvement on your actual RAG queries2 hours
- Read the speculative decoding paper abstract + method (15 minutes)
- Download Gemma 2B as the draft model via MLX
- Configure MLX speculative decoding: Gemma 2B drafts, Gemma 3 4B verifies
- Benchmark: run 10 RAG queries with and without speculative decoding, measure tokens/second and output quality
- Document the speedup and when it helps most (long outputs vs short answers)
Speculative decoding paper (Leviathan et al.): https://arxiv.org/abs/2211.17192MLX speculative decoding example: https://github.com/ml-explore/mlx-examples/tree/main/llms
○PRACTICAL
Watch Karpathy's "Let's build GPT" and implement along with it
→Do it on the Mac mini. Train a tiny GPT on your iMessage corpus (already indexed in Chroma) — a character-level model that generates text in your writing style1 full afternoon (4 hours)
- Set aside a Saturday afternoon. Watch the full 2-hour video, pausing to code along in a Jupyter notebook
- Export your iMessage text from the Chroma personal_docs collection as training data
- Replace Shakespeare with your iMessage data — train a character-level GPT
- Generate sample outputs and compare with your actual writing style
- Write up: 'What I learned building a GPT from scratch' — key insights about attention, training, and loss
Karpathy's 'Let's build GPT from scratch': https://www.youtube.com/watch?v=kCc8FmEb1nYKarpathy's nanoGPT repo: https://github.com/karpathy/nanoGPTMLX training examples: https://github.com/ml-explore/mlx-examples
○EVAL
MILESTONE: Explain transformer inference pipeline from token to probability
→Write a blog post or Obsidian note walking through exactly what happens when your local Gemma 3 4B processes a query — from raw text to tokenisation to embeddings to attention to output probabilities3 hours
- Draft the outline: tokenisation → embedding lookup → positional encoding → N transformer blocks → layer norm → logits → sampling
- For each step, reference the actual Gemma 3 config values (vocab size, embedding dim, num layers, num heads)
- Add diagrams showing tensor shapes at each stage
- Have someone technical read it and ask questions — if you can answer, you understand it
Andrej Karpathy's 'Let's build GPT' as reference: https://www.youtube.com/watch?v=kCc8FmEb1nYThe Illustrated GPT-2 by Jay Alammar: https://jalammar.github.io/illustrated-gpt2/