Methods & Trust
How knowledge.hempin.org turns documents into cited answers. This page is a living spec of our pipeline, quality rules, costs, and legal posture.
RAGSupabase + pgvectorEmbeddings: text-embedding-3-smallModel: gpt-4o-miniCitations on every live answer
Pipeline (today)
- Ingest — PDFs are uploaded to a private Supabase Storage bucket. We record metadata (title, publisher, year, language, jurisdiction, topics) in
documents
. - Parse — Server code extracts text (pdf.js), normalizes Unicode, and splits into overlapping chunks (~4.5k chars, 400 overlap). Results go to
chunks
. - Embed — One embedding per chunk (1536-dim). Stored in
chunks.embedding
. One-time per document version. - Retrieve — A question is embedded and matched via a cosine search function (
match_chunks
) to fetch the top-K snippets. - Compose — The model must answer only from the retrieved snippets and cite them inline as [#1], [#2]. The UI links each marker to the source document.
- FAQ cache — Reviewed answers can be promoted to an FAQ table with its own vector index so repeated queries are free and instant.
Evidence & Citations
- Answers are grounded in retrieved chunks; thin or conflicting evidence is disclosed.
- Citations are chunk-level today. Page spans and quote highlighting are on the roadmap.
- We prefer primary sources (standards, laws, field trials, meta-analyses); grey literature is marked.
Quality & Limitations
- Parsing: scanned PDFs require OCR (planned); those won’t yield chunks yet.
- Coverage: the atlas “knows” only what’s ingested. Missing years/jurisdictions → missing answers.
- Model bias: low temperature & “answer from context” reduce hallucinations, but citations should always be checked.
- Conflicts: when sources disagree, we surface the split and avoid declaring a winner without criteria (date, N, design, jurisdiction).
Privacy & Security
- Uploads live in a private bucket; tables use Row-Level Security; admin routes use a server key.
- Public access is gated (basic auth / invite) while in early access.
- Questions and answers may be logged internally to improve the FAQ. Private uploads aren’t shared beyond model providers used for embeddings/answers.
Cost Transparency
- Parsing is local (no per-token cost).
- Embeddings are a one-time cost per chunk; re-embed only if text or the embedding model changes.
- Each new question = 1 query embedding + 1 chat call. FAQ hits are free.
Legal & Copyright
- Attribution by design: every live answer cites and links the underlying source.
- Source scope: we ingest public materials or items you upload with rights to use. For private/licensed content, access remains restricted to authorized users.
- Robots & terms: when crawling, we respect
robots.txt
and site terms. We store text snippets for search, not whole-work republication. - Fair-use posture: the UI shows short excerpts necessary for retrieval + citation. We avoid bulk display or redistribution of full PDFs unless they’re openly licensed.
- Takedown / opt-out: rights holders can request removal or stricter access by sending an email to our team.
- Not legal advice: outputs are informational and may be jurisdiction-specific. For compliance decisions, consult official texts and qualified counsel.
Roadmap
- OCR for scanned PDFs with language detection.
- Page-accurate citations + snippet highlighting.
- Trust/Stability signals (source age, study design, N, jurisdiction currency).
- Law versioning with “valid through” metadata per country/standard.
- Admin review → promote Live answers to FAQ with diff history.
- Graph/nebula view of topics, papers, and legal relationships.
“Why this answer?” (method notes)
We rank chunks by cosine similarity to your question embedding and cap K (default 8). The prompt instructs the model to use only those snippets. Inline markers [#n] map to ranked chunks so you can inspect the raw text in context.
For higher-stakes guidance (e.g., agronomy), we can enable a chain-of-critique mode that lists assumptions and evidence quality before recommendations.