Build hurdles, in the order I hit them
1. Service account IAM over-permissioning Google Cloud’s service account creation flow nudged me toward granting project-level “Editor” IAM. This gave the key way more power than needed — stolen key = full project access, not just one sheet. Fix: removed the IAM role entirely; sheet access comes from the sheet’s Share settings, not project IAM.
2. Docker volume mount: wrong -v syntax Wrote -v /opt/openwebui/prometheus.json:ro — missing the container-side path. Docker parsed ro as the container path and rejected it. Fix: proper three-part syntax -v HOST:CONTAINER:options.
3. Docker volume mount: silent directory creation Ran the container before the host file existed. Docker “helpfully” created an empty directory at the host path, then mounted that directory onto the container’s file path. Every subsequent run failed with “not a directory” because the phantom directory persisted in the named volume. Fix: clean the bad directory out of the volume with a throwaway Alpine container, ensure the real key file exists on the host before the mount.
4. VM clock drift from pausing Paused the VM multiple times; it came back up 6 days behind real time. Broke TLS everywhere — Docker image pulls failed with “certificate not yet valid”, then Google rejected JWTs for having impossible timestamps. Fix: sudo date -s to manually set the clock, install chrony for proper NTP sync, restart Docker to clear cached TLS state.
5. Time fix didn’t propagate to the running container Host clock corrected, but the Python gspread client instance inside OpenWebUI had cached the bad time in its credentials object. Fix: restart the container to force re-initialization.
6. Tool save failures from eager __init__ First version of the tool did gspread.service_account(...) in Tools.__init__. OpenWebUI instantiates the tool during save validation, so any auth hiccup = can’t save the tool at all. Fix: lazy-load the sheet client inside each method, wrap everything in try/except that returns error strings instead of raising.
7. Forgot to paste the actual sheet ID Left SHEET_ID = "paste-your-sheet-id-here" as the placeholder. Got 404s from Google. Trivial fix, embarrassing that it happened after so many other hurdles had been cleared.
8. Forgot to share the sheet with the service account Even with the right ID, Google returns 404 to service accounts that don’t have read access. Fix: shared the sheet with the service account’s email as Editor, with “Notify people” unchecked.
9. Model selection rabbit hole Discovered Qwen 2.5 7B ran at 2.2 tok/s on the Optiplex — theoretically better for tool calling, practically painful. Gemma variants were faster per-token but verbose. Real lesson: on CPU inference, wall-clock time to finished answer matters more than tok/s benchmarks, and smaller models often win in practice even when they’re “dumber.”
10. Tool call overhead surprise Learned that successful tool calls require TWO full model passes (decide-to-call + summarize-result), each paying full prompt-eval cost. On CPU, this makes the “successful” path slower than the “failed” path. Design implication: bulk operations > many small calls, terse system prompts matter a lot, pre-formatted tool return strings can skip the summarization pass entirely.
Meta-lesson
The actual AI/ML part was maybe 10% of the work. The other 90% was Docker, Linux time sync, Google Cloud IAM, credential management, and debugging through error messages from five different systems that didn’t know about each other. This is the real shape of AI tooling work — glue code and infrastructure, with the model itself being one small component that mostly just needs to not be sabotaged by the plumbing.
Single most valuable habit from this session
dateandtimedatectlas the first commands after resuming any paused VM. Saves hours.