AI Skills — Kieran Charnock

Phase 1 of 6

Foundations

Mental models for how AI systems work

58%

15Done

7Partial

10Todo

◐CONCEPT

AI vs ML vs deep learning vs generative AI — the taxonomy

→Write a taxonomy doc comparing your 12+ projects by AI type — classify each (Dosh=GenAI text+vision, ServicePulse=ML classification pipeline, FPL BlackBox=GenAI+edge ML, Local LLM Stack=self-hosted inference)2 hours

Create a spreadsheet with columns: project, AI type, model(s) used, input modality, output type
Classify each project: pure GenAI, ML pipeline, hybrid, or rule-based with AI augmentation
Write a 1-page summary explaining where each project sits on the AI taxonomy and why
Identify gaps — which AI types are you not using yet (e.g. reinforcement learning)?

Google's ML Crash Course: https://developers.google.com/machine-learning/crash-courseStanford CS229 lecture 1 (Andrew Ng) — the taxonomy explained in 20 minutes

○CONCEPT

Neural networks: weights, activations, backpropagation

→Follow Karpathy's Zero to Hero series, then train a tiny classifier on your Dosh transaction categories using PyTorch on the Mac mini1 week evenings

Watch Karpathy's 'The spelled-out intro to neural networks' (2.5 hours) — code along in a Jupyter notebook
Export 500 labelled Dosh transactions from SQLite with their categories
Build a simple 2-layer neural network in PyTorch that classifies transactions by category
Train it on Mac mini, plot loss curves, and compare accuracy to your Haiku prompt-based categoriser
Write up what backpropagation actually does in plain English — test by explaining it to someone

Karpathy's Neural Networks: Zero to Hero playlist: https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ3Blue1Brown 'Neural Networks' series (4 videos, ~1 hour total): https://www.3blue1brown.com/topics/neural-networksPyTorch 60-minute blitz: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

◐CONCEPT

Supervised / unsupervised / reinforcement learning

→Your Dosh categoriser is supervised learning (labelled transactions). Add k-means clustering (unsupervised) to automatically group similar transactions and discover spending patterns you haven't labelled yet1 afternoon

Export Dosh transactions with embeddings (or generate embeddings with nomic-embed-text on Mac mini)
Run k-means clustering (scikit-learn) with k=10,20,30 and compare cluster quality with silhouette scores
Visualise clusters with UMAP or t-SNE — do they match your existing categories?
Write a comparison: supervised (your Haiku categoriser) vs unsupervised (clusters) — when would you use each?

scikit-learn clustering guide: https://scikit-learn.org/stable/modules/clustering.htmlGoogle's ML Crash Course — Clustering module: https://developers.google.com/machine-learning/clustering

◐CONCEPT

Transformer architecture — why it matters

→Watch Karpathy's 'Let's build GPT' while running inference on your Mac mini with MLX — pause at each architecture component and map it to what's happening in your local Gemma 3 4B model1 afternoon

Watch Karpathy's 'Let's build GPT from scratch' (2 hours) — take notes on each component
Run Gemma 3 4B on Mac mini via MLX and inspect the model config (layers, heads, dimensions)
Map paper concepts to real values: how many attention heads? What's the embedding dimension? How many parameters per layer?
Write a 1-pager: 'What happens inside Gemma when I send it a prompt' — trace the full path

Karpathy's 'Let's build GPT': https://www.youtube.com/watch?v=kCc8FmEb1nYThe Illustrated Transformer by Jay Alammar: https://jalammar.github.io/illustrated-transformer/

✓CONCEPT

Tokens, embeddings, and vector spacesclick to expand

✓CONCEPT

Temperature, top-p, top-k — how sampling worksclick to expand

✓CONCEPT

Context windows: limits, growth, implicationsclick to expand

◐CONCEPT

Training vs inference — economics and trade-offs

→Build a cost dashboard aggregating inference spend across all your projects (Anthropic, OpenAI, Azure, Cloudflare). Compare monthly inference cost vs what fine-tuning Gemma 4B on Mac mini would cost in electricity2 hours

Pull billing data from Anthropic Console, Azure Portal, and Cloudflare dashboard for the last 3 months
Create a spreadsheet: project, provider, monthly tokens, monthly cost, cost per request
Calculate Mac mini electricity cost for local inference (measure wattage during MLX inference, multiply by hours)
Write a summary: 'My AI economics — where money goes and where local saves it'

Anthropic pricing page: https://www.anthropic.com/pricinga]i analysis 'The Inference Cost Equation' — search on AI Monitor RSS feeds

✓CONCEPT

Fine-tuning vs prompting vs RAG — when to use whichclick to expand

◐CONCEPT

RLHF — what it is and why it matters

→Read the Constitutional AI paper (Anthropic, 2022) — you reference it in your prompt library constantly. Your ha-log-agent's risk-based approval tiers are a practical form of alignment via human feedback1 hour

Read the Constitutional AI paper — focus on sections 2 (method) and 4 (results), skip appendix (45 minutes)
Map ha-log-agent's approval tiers to the paper's concepts: where is the 'constitution'? Where is the 'human feedback'?
Write a short note: 'How RLHF connects to what I build' — reference Telegram bot, ha-log-agent, and prompt library

Constitutional AI paper: https://arxiv.org/abs/2212.08073Anthropic's RLHF blog post: https://www.anthropic.com/research/training-a-helpful-and-harmless-assistantChip Huyen's RLHF explainer: https://huyenchip.com/2023/05/02/rlhf.html

✓CONCEPT

GPT-4o/5, Claude, Gemini, Llama, Mistral, Qwen — capabilities and trade-offsclick to expand

✓CONCEPT

Open vs closed source modelsclick to expand

✓CONCEPT

Multimodal models — vision, audio, videoclick to expand

○CONCEPT

Model benchmarks: MMLU, HumanEval, HELM, LMSys Arena

→Run your FPL BlackBox chat prompt through 3 models (Claude Haiku, GPT-4o-mini, Llama 3.2 3B local), score outputs on accuracy and helpfulness. Compare your scores to public benchmarks to see where they align and diverge1 afternoon

Create 10 FPL test prompts covering different complexities (simple lookup, multi-step reasoning, opinion)
Run each through Claude Haiku (FPL BlackBox), GPT-4o-mini (via LiteLLM), and Llama 3.2 3B (Mac mini)
Score each response 1-5 on accuracy, helpfulness, and latency. Create a comparison table
Look up the same models on LMSys Chatbot Arena and MMLU leaderboard — where do your findings agree/disagree?
Write a 1-pager: 'What benchmarks miss — lessons from my own evaluation'

LMSys Chatbot Arena leaderboard: https://chat.lmsys.org/?leaderboardHELM benchmark: https://crfm.stanford.edu/helm/latest/Hugging Face Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

○CONCEPT

Scaling laws — compute, data, parameters

→Use your AI Monitor RSS feeds to flag scaling laws papers. Read the Chinchilla paper and 2 follow-ups — your feeds already track DeepMind and Anthropic who publish on this3 hours

Search AI Monitor for 'scaling laws' articles from the last 6 months — pick the 3 highest-scored
Read the Chinchilla paper (Hoffmann et al., 2022) — focus on the key finding: optimal compute allocation
Read one follow-up (e.g. 'Scaling Data-Constrained Language Models') to see how the field has evolved
Write a summary: 'What scaling laws mean for my model choices' — why Gemma 4B exists, why Haiku is cheap

Chinchilla paper: https://arxiv.org/abs/2203.15556Scaling Data-Constrained LMs: https://arxiv.org/abs/2305.16264Epoch AI scaling laws dashboard: https://epochai.org/data/notable-ai-models

◐CONCEPT

Reasoning models: o3, DeepSeek-R1, extended thinking

→You already use extended thinking in Telegram bot and deepseek-r1:14b on Mac mini. Run a structured comparison: same 5 complex prompts through Claude extended thinking, DeepSeek-R1 14B local, and standard Claude Sonnet. Measure quality and latency2 hours

Design 5 test prompts requiring multi-step reasoning (e.g. 'Plan my FPL transfers considering fixture swings, budget, and bench coverage')
Run each through: Claude Sonnet (no thinking), Claude Sonnet (extended thinking), DeepSeek-R1 14B (Mac mini)
Score outputs on reasoning depth, accuracy, and time-to-response
Document when extended thinking helps vs when it's overkill — create a decision framework

Anthropic extended thinking docs: https://docs.anthropic.com/en/docs/build-with-claude/extended-thinkingDeepSeek-R1 paper: https://arxiv.org/abs/2401.12474

◐CONCEPT

SLMs — Phi, Gemma, Qwen 0.5B — when small beats big

→Your auto-router already uses Gemma 3 4B QAT. Benchmark it against llama3.2:3b and qwen3:8b on 20 of your actual RAG queries from the local doc watcher. Document when the smaller model wins on speed without losing quality2 hours

Export 20 recent queries from your local LLM stack logs (variety of topics)
Run each through Gemma 3 4B (MLX), Llama 3.2 3B (Ollama), and Qwen 3 8B (Ollama)
Measure: tokens/second, response quality (1-5), and RAM usage for each
Create a decision matrix: for which query types does smaller win? Where does 8B quality matter?

Microsoft Phi-3 technical report: https://arxiv.org/abs/2404.14219Google Gemma 3 model card: https://ai.google.dev/gemma/docs

✓STRATEGIC

How to evaluate and select models for use casesclick to expand

○CONCEPT

Attention mechanism — self-attention, multi-head, cross-attention

→Local LLM Stack — add attention weight visualisation to Open WebUI or use BertViz with a small model. MLX exposes attention weights from Gemma 3 during inference1 afternoon

Watch 3Blue1Brown's 'Attention in transformers, visually explained' (26 min)
Install BertViz (pip install bertviz) and visualise attention patterns on a small BERT model with your own text
Use MLX's model inspection to extract attention weights from Gemma 3 4B on a sample prompt
Compare attention patterns across layers — which heads attend to syntax vs semantics?
Write a 1-pager explaining self-attention, multi-head, and cross-attention with your own examples

3Blue1Brown 'Attention in transformers': https://www.youtube.com/watch?v=eMlx5fFNoYcBertViz interactive attention visualisation: https://github.com/jessevig/bertvizThe Illustrated Transformer (attention section): https://jalammar.github.io/illustrated-transformer/

○CONCEPT

Positional encoding — absolute, relative, RoPE

→Karpathy's Zero to Hero covers positional encoding. Compare how Gemma 3 (uses RoPE) handles long context vs older absolute-position models. Your local stack can test this empirically2 hours

Watch the positional encoding section of Karpathy's 'Let's build GPT' (timestamp ~45 min)
Read the RoPE paper abstract + method section (15 minutes)
Test Gemma 3 4B on prompts with key info at positions 100, 1000, and 3000 tokens — does retrieval accuracy degrade?
Compare with Llama 3.2 3B (also RoPE) at the same positions — do they behave similarly?

RoPE paper (Su et al., 2021): https://arxiv.org/abs/2104.09864Karpathy's 'Let's build GPT' positional encoding section: https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2700sEleuther AI blog on positional encodings: https://blog.eleuther.ai/rotary-embeddings/

○CONCEPT

KV cache — what it is and why it matters for inference cost

→Local LLM Stack — measure KV cache memory consumption on Mac mini across different context lengths (512, 2048, 8192 tokens). Compare Gemma 3 4B vs Qwen 3 8B memory usage to see how model size affects cache2 hours

Read the MLX documentation on KV cache management (10 minutes)
Write a script that sends prompts of increasing length (512, 1024, 2048, 4096, 8192 tokens) to Gemma 3 via MLX
Monitor memory usage with Activity Monitor or `memory_pressure` at each context length — record in a table
Repeat with Qwen 3 8B via Ollama and compare memory growth curves
Write up: 'Why KV cache matters for my local inference — memory budget planning'

MLX LM documentation on caching: https://github.com/ml-explore/mlx-examples/tree/main/llmsEfficient Transformers survey (Tay et al.): https://arxiv.org/abs/2009.06732

○CONCEPT

Mixture of Experts (MoE) — how Mixtral and GPT-4 likely work

→AI Monitor — pull Mixtral 8x7B via Ollama on Mac mini and compare against Gemma 3 4B on your RSS relevance scoring task. MoE models activate fewer parameters per token — measure the speed/quality trade-off3 hours

Read the Mixtral paper abstract + architecture section (20 minutes)
Pull mixtral:8x7b via Ollama (if RAM allows) or use a smaller MoE model
Run 20 AI Monitor RSS articles through both Mixtral and Gemma — score relevance, measure latency
Compare: does MoE's sparse activation give better quality-per-second than a dense model?

Mixtral paper: https://arxiv.org/abs/2401.04088Switch Transformers paper (the MoE foundation): https://arxiv.org/abs/2101.03961Hugging Face MoE explainer: https://huggingface.co/blog/moe

○CONCEPT

Speculative decoding — how it speeds up inference

→Local LLM Stack — enable MLX speculative decoding using a small draft model (Gemma 2B) to speed up Gemma 3 4B generation. Measure the tokens/second improvement on your actual RAG queries2 hours

Read the speculative decoding paper abstract + method (15 minutes)
Download Gemma 2B as the draft model via MLX
Configure MLX speculative decoding: Gemma 2B drafts, Gemma 3 4B verifies
Benchmark: run 10 RAG queries with and without speculative decoding, measure tokens/second and output quality
Document the speedup and when it helps most (long outputs vs short answers)

Speculative decoding paper (Leviathan et al.): https://arxiv.org/abs/2211.17192MLX speculative decoding example: https://github.com/ml-explore/mlx-examples/tree/main/llms

○PRACTICAL

Watch Karpathy's "Let's build GPT" and implement along with it

→Do it on the Mac mini. Train a tiny GPT on your iMessage corpus (already indexed in Chroma) — a character-level model that generates text in your writing style1 full afternoon (4 hours)

Set aside a Saturday afternoon. Watch the full 2-hour video, pausing to code along in a Jupyter notebook
Export your iMessage text from the Chroma personal_docs collection as training data
Replace Shakespeare with your iMessage data — train a character-level GPT
Generate sample outputs and compare with your actual writing style
Write up: 'What I learned building a GPT from scratch' — key insights about attention, training, and loss

Karpathy's 'Let's build GPT from scratch': https://www.youtube.com/watch?v=kCc8FmEb1nYKarpathy's nanoGPT repo: https://github.com/karpathy/nanoGPTMLX training examples: https://github.com/ml-explore/mlx-examples

○EVAL

MILESTONE: Explain transformer inference pipeline from token to probability

→Write a blog post or Obsidian note walking through exactly what happens when your local Gemma 3 4B processes a query — from raw text to tokenisation to embeddings to attention to output probabilities3 hours

Draft the outline: tokenisation → embedding lookup → positional encoding → N transformer blocks → layer norm → logits → sampling
For each step, reference the actual Gemma 3 config values (vocab size, embedding dim, num layers, num heads)
Add diagrams showing tensor shapes at each stage
Have someone technical read it and ask questions — if you can answer, you understand it

Andrej Karpathy's 'Let's build GPT' as reference: https://www.youtube.com/watch?v=kCc8FmEb1nYThe Illustrated GPT-2 by Jay Alammar: https://jalammar.github.io/illustrated-gpt2/