AI Expert Roadmap

69
194Total
114Done
41Partial
39Todo
CONCEPTPRACTICALSTRATEGICEVAL
Phase 1 of 6

Foundations

Mental models for how AI systems work

58%
15Done
7Partial
10Todo
CONCEPT
AI vs ML vs deep learning vs generative AI — the taxonomy
Write a taxonomy doc comparing your 12+ projects by AI type — classify each (Dosh=GenAI text+vision, ServicePulse=ML classification pipeline, FPL BlackBox=GenAI+edge ML, Local LLM Stack=self-hosted inference)2 hours
  1. Create a spreadsheet with columns: project, AI type, model(s) used, input modality, output type
  2. Classify each project: pure GenAI, ML pipeline, hybrid, or rule-based with AI augmentation
  3. Write a 1-page summary explaining where each project sits on the AI taxonomy and why
  4. Identify gaps — which AI types are you not using yet (e.g. reinforcement learning)?
Google's ML Crash Course: https://developers.google.com/machine-learning/crash-courseStanford CS229 lecture 1 (Andrew Ng) — the taxonomy explained in 20 minutes
CONCEPT
Neural networks: weights, activations, backpropagation
Follow Karpathy's Zero to Hero series, then train a tiny classifier on your Dosh transaction categories using PyTorch on the Mac mini1 week evenings
  1. Watch Karpathy's 'The spelled-out intro to neural networks' (2.5 hours) — code along in a Jupyter notebook
  2. Export 500 labelled Dosh transactions from SQLite with their categories
  3. Build a simple 2-layer neural network in PyTorch that classifies transactions by category
  4. Train it on Mac mini, plot loss curves, and compare accuracy to your Haiku prompt-based categoriser
  5. Write up what backpropagation actually does in plain English — test by explaining it to someone
Karpathy's Neural Networks: Zero to Hero playlist: https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ3Blue1Brown 'Neural Networks' series (4 videos, ~1 hour total): https://www.3blue1brown.com/topics/neural-networksPyTorch 60-minute blitz: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
CONCEPT
Supervised / unsupervised / reinforcement learning
Your Dosh categoriser is supervised learning (labelled transactions). Add k-means clustering (unsupervised) to automatically group similar transactions and discover spending patterns you haven't labelled yet1 afternoon
  1. Export Dosh transactions with embeddings (or generate embeddings with nomic-embed-text on Mac mini)
  2. Run k-means clustering (scikit-learn) with k=10,20,30 and compare cluster quality with silhouette scores
  3. Visualise clusters with UMAP or t-SNE — do they match your existing categories?
  4. Write a comparison: supervised (your Haiku categoriser) vs unsupervised (clusters) — when would you use each?
scikit-learn clustering guide: https://scikit-learn.org/stable/modules/clustering.htmlGoogle's ML Crash Course — Clustering module: https://developers.google.com/machine-learning/clustering
CONCEPT
Transformer architecture — why it matters
Watch Karpathy's 'Let's build GPT' while running inference on your Mac mini with MLX — pause at each architecture component and map it to what's happening in your local Gemma 3 4B model1 afternoon
  1. Watch Karpathy's 'Let's build GPT from scratch' (2 hours) — take notes on each component
  2. Run Gemma 3 4B on Mac mini via MLX and inspect the model config (layers, heads, dimensions)
  3. Map paper concepts to real values: how many attention heads? What's the embedding dimension? How many parameters per layer?
  4. Write a 1-pager: 'What happens inside Gemma when I send it a prompt' — trace the full path
Karpathy's 'Let's build GPT': https://www.youtube.com/watch?v=kCc8FmEb1nYThe Illustrated Transformer by Jay Alammar: https://jalammar.github.io/illustrated-transformer/
CONCEPT
Tokens, embeddings, and vector spacesclick to expand
CONCEPT
Temperature, top-p, top-k — how sampling worksclick to expand
CONCEPT
Context windows: limits, growth, implicationsclick to expand
CONCEPT
Training vs inference — economics and trade-offs
Build a cost dashboard aggregating inference spend across all your projects (Anthropic, OpenAI, Azure, Cloudflare). Compare monthly inference cost vs what fine-tuning Gemma 4B on Mac mini would cost in electricity2 hours
  1. Pull billing data from Anthropic Console, Azure Portal, and Cloudflare dashboard for the last 3 months
  2. Create a spreadsheet: project, provider, monthly tokens, monthly cost, cost per request
  3. Calculate Mac mini electricity cost for local inference (measure wattage during MLX inference, multiply by hours)
  4. Write a summary: 'My AI economics — where money goes and where local saves it'
Anthropic pricing page: https://www.anthropic.com/pricinga]i analysis 'The Inference Cost Equation' — search on AI Monitor RSS feeds
CONCEPT
Fine-tuning vs prompting vs RAG — when to use whichclick to expand
CONCEPT
RLHF — what it is and why it matters
Read the Constitutional AI paper (Anthropic, 2022) — you reference it in your prompt library constantly. Your ha-log-agent's risk-based approval tiers are a practical form of alignment via human feedback1 hour
  1. Read the Constitutional AI paper — focus on sections 2 (method) and 4 (results), skip appendix (45 minutes)
  2. Map ha-log-agent's approval tiers to the paper's concepts: where is the 'constitution'? Where is the 'human feedback'?
  3. Write a short note: 'How RLHF connects to what I build' — reference Telegram bot, ha-log-agent, and prompt library
Constitutional AI paper: https://arxiv.org/abs/2212.08073Anthropic's RLHF blog post: https://www.anthropic.com/research/training-a-helpful-and-harmless-assistantChip Huyen's RLHF explainer: https://huyenchip.com/2023/05/02/rlhf.html
CONCEPT
GPT-4o/5, Claude, Gemini, Llama, Mistral, Qwen — capabilities and trade-offsclick to expand
CONCEPT
Open vs closed source modelsclick to expand
CONCEPT
Multimodal models — vision, audio, videoclick to expand
CONCEPT
Model benchmarks: MMLU, HumanEval, HELM, LMSys Arena
Run your FPL BlackBox chat prompt through 3 models (Claude Haiku, GPT-4o-mini, Llama 3.2 3B local), score outputs on accuracy and helpfulness. Compare your scores to public benchmarks to see where they align and diverge1 afternoon
  1. Create 10 FPL test prompts covering different complexities (simple lookup, multi-step reasoning, opinion)
  2. Run each through Claude Haiku (FPL BlackBox), GPT-4o-mini (via LiteLLM), and Llama 3.2 3B (Mac mini)
  3. Score each response 1-5 on accuracy, helpfulness, and latency. Create a comparison table
  4. Look up the same models on LMSys Chatbot Arena and MMLU leaderboard — where do your findings agree/disagree?
  5. Write a 1-pager: 'What benchmarks miss — lessons from my own evaluation'
LMSys Chatbot Arena leaderboard: https://chat.lmsys.org/?leaderboardHELM benchmark: https://crfm.stanford.edu/helm/latest/Hugging Face Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
CONCEPT
Scaling laws — compute, data, parameters
Use your AI Monitor RSS feeds to flag scaling laws papers. Read the Chinchilla paper and 2 follow-ups — your feeds already track DeepMind and Anthropic who publish on this3 hours
  1. Search AI Monitor for 'scaling laws' articles from the last 6 months — pick the 3 highest-scored
  2. Read the Chinchilla paper (Hoffmann et al., 2022) — focus on the key finding: optimal compute allocation
  3. Read one follow-up (e.g. 'Scaling Data-Constrained Language Models') to see how the field has evolved
  4. Write a summary: 'What scaling laws mean for my model choices' — why Gemma 4B exists, why Haiku is cheap
Chinchilla paper: https://arxiv.org/abs/2203.15556Scaling Data-Constrained LMs: https://arxiv.org/abs/2305.16264Epoch AI scaling laws dashboard: https://epochai.org/data/notable-ai-models
CONCEPT
Reasoning models: o3, DeepSeek-R1, extended thinking
You already use extended thinking in Telegram bot and deepseek-r1:14b on Mac mini. Run a structured comparison: same 5 complex prompts through Claude extended thinking, DeepSeek-R1 14B local, and standard Claude Sonnet. Measure quality and latency2 hours
  1. Design 5 test prompts requiring multi-step reasoning (e.g. 'Plan my FPL transfers considering fixture swings, budget, and bench coverage')
  2. Run each through: Claude Sonnet (no thinking), Claude Sonnet (extended thinking), DeepSeek-R1 14B (Mac mini)
  3. Score outputs on reasoning depth, accuracy, and time-to-response
  4. Document when extended thinking helps vs when it's overkill — create a decision framework
Anthropic extended thinking docs: https://docs.anthropic.com/en/docs/build-with-claude/extended-thinkingDeepSeek-R1 paper: https://arxiv.org/abs/2401.12474
CONCEPT
SLMs — Phi, Gemma, Qwen 0.5B — when small beats big
Your auto-router already uses Gemma 3 4B QAT. Benchmark it against llama3.2:3b and qwen3:8b on 20 of your actual RAG queries from the local doc watcher. Document when the smaller model wins on speed without losing quality2 hours
  1. Export 20 recent queries from your local LLM stack logs (variety of topics)
  2. Run each through Gemma 3 4B (MLX), Llama 3.2 3B (Ollama), and Qwen 3 8B (Ollama)
  3. Measure: tokens/second, response quality (1-5), and RAM usage for each
  4. Create a decision matrix: for which query types does smaller win? Where does 8B quality matter?
Microsoft Phi-3 technical report: https://arxiv.org/abs/2404.14219Google Gemma 3 model card: https://ai.google.dev/gemma/docs
STRATEGIC
How to evaluate and select models for use casesclick to expand
CONCEPT
Attention mechanism — self-attention, multi-head, cross-attention
Local LLM Stack — add attention weight visualisation to Open WebUI or use BertViz with a small model. MLX exposes attention weights from Gemma 3 during inference1 afternoon
  1. Watch 3Blue1Brown's 'Attention in transformers, visually explained' (26 min)
  2. Install BertViz (pip install bertviz) and visualise attention patterns on a small BERT model with your own text
  3. Use MLX's model inspection to extract attention weights from Gemma 3 4B on a sample prompt
  4. Compare attention patterns across layers — which heads attend to syntax vs semantics?
  5. Write a 1-pager explaining self-attention, multi-head, and cross-attention with your own examples
3Blue1Brown 'Attention in transformers': https://www.youtube.com/watch?v=eMlx5fFNoYcBertViz interactive attention visualisation: https://github.com/jessevig/bertvizThe Illustrated Transformer (attention section): https://jalammar.github.io/illustrated-transformer/
CONCEPT
Positional encoding — absolute, relative, RoPE
Karpathy's Zero to Hero covers positional encoding. Compare how Gemma 3 (uses RoPE) handles long context vs older absolute-position models. Your local stack can test this empirically2 hours
  1. Watch the positional encoding section of Karpathy's 'Let's build GPT' (timestamp ~45 min)
  2. Read the RoPE paper abstract + method section (15 minutes)
  3. Test Gemma 3 4B on prompts with key info at positions 100, 1000, and 3000 tokens — does retrieval accuracy degrade?
  4. Compare with Llama 3.2 3B (also RoPE) at the same positions — do they behave similarly?
RoPE paper (Su et al., 2021): https://arxiv.org/abs/2104.09864Karpathy's 'Let's build GPT' positional encoding section: https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2700sEleuther AI blog on positional encodings: https://blog.eleuther.ai/rotary-embeddings/
CONCEPT
KV cache — what it is and why it matters for inference cost
Local LLM Stack — measure KV cache memory consumption on Mac mini across different context lengths (512, 2048, 8192 tokens). Compare Gemma 3 4B vs Qwen 3 8B memory usage to see how model size affects cache2 hours
  1. Read the MLX documentation on KV cache management (10 minutes)
  2. Write a script that sends prompts of increasing length (512, 1024, 2048, 4096, 8192 tokens) to Gemma 3 via MLX
  3. Monitor memory usage with Activity Monitor or `memory_pressure` at each context length — record in a table
  4. Repeat with Qwen 3 8B via Ollama and compare memory growth curves
  5. Write up: 'Why KV cache matters for my local inference — memory budget planning'
MLX LM documentation on caching: https://github.com/ml-explore/mlx-examples/tree/main/llmsEfficient Transformers survey (Tay et al.): https://arxiv.org/abs/2009.06732
CONCEPT
Mixture of Experts (MoE) — how Mixtral and GPT-4 likely work
AI Monitor — pull Mixtral 8x7B via Ollama on Mac mini and compare against Gemma 3 4B on your RSS relevance scoring task. MoE models activate fewer parameters per token — measure the speed/quality trade-off3 hours
  1. Read the Mixtral paper abstract + architecture section (20 minutes)
  2. Pull mixtral:8x7b via Ollama (if RAM allows) or use a smaller MoE model
  3. Run 20 AI Monitor RSS articles through both Mixtral and Gemma — score relevance, measure latency
  4. Compare: does MoE's sparse activation give better quality-per-second than a dense model?
Mixtral paper: https://arxiv.org/abs/2401.04088Switch Transformers paper (the MoE foundation): https://arxiv.org/abs/2101.03961Hugging Face MoE explainer: https://huggingface.co/blog/moe
CONCEPT
Speculative decoding — how it speeds up inference
Local LLM Stack — enable MLX speculative decoding using a small draft model (Gemma 2B) to speed up Gemma 3 4B generation. Measure the tokens/second improvement on your actual RAG queries2 hours
  1. Read the speculative decoding paper abstract + method (15 minutes)
  2. Download Gemma 2B as the draft model via MLX
  3. Configure MLX speculative decoding: Gemma 2B drafts, Gemma 3 4B verifies
  4. Benchmark: run 10 RAG queries with and without speculative decoding, measure tokens/second and output quality
  5. Document the speedup and when it helps most (long outputs vs short answers)
Speculative decoding paper (Leviathan et al.): https://arxiv.org/abs/2211.17192MLX speculative decoding example: https://github.com/ml-explore/mlx-examples/tree/main/llms
PRACTICAL
Watch Karpathy's "Let's build GPT" and implement along with it
Do it on the Mac mini. Train a tiny GPT on your iMessage corpus (already indexed in Chroma) — a character-level model that generates text in your writing style1 full afternoon (4 hours)
  1. Set aside a Saturday afternoon. Watch the full 2-hour video, pausing to code along in a Jupyter notebook
  2. Export your iMessage text from the Chroma personal_docs collection as training data
  3. Replace Shakespeare with your iMessage data — train a character-level GPT
  4. Generate sample outputs and compare with your actual writing style
  5. Write up: 'What I learned building a GPT from scratch' — key insights about attention, training, and loss
Karpathy's 'Let's build GPT from scratch': https://www.youtube.com/watch?v=kCc8FmEb1nYKarpathy's nanoGPT repo: https://github.com/karpathy/nanoGPTMLX training examples: https://github.com/ml-explore/mlx-examples
EVAL
MILESTONE: Explain transformer inference pipeline from token to probability
Write a blog post or Obsidian note walking through exactly what happens when your local Gemma 3 4B processes a query — from raw text to tokenisation to embeddings to attention to output probabilities3 hours
  1. Draft the outline: tokenisation → embedding lookup → positional encoding → N transformer blocks → layer norm → logits → sampling
  2. For each step, reference the actual Gemma 3 config values (vocab size, embedding dim, num layers, num heads)
  3. Add diagrams showing tensor shapes at each stage
  4. Have someone technical read it and ask questions — if you can answer, you understand it
Andrej Karpathy's 'Let's build GPT' as reference: https://www.youtube.com/watch?v=kCc8FmEb1nYThe Illustrated GPT-2 by Jay Alammar: https://jalammar.github.io/illustrated-gpt2/