Methods & Trust

How knowledge.hempin.org turns documents into cited answers. This page is a living spec of our pipeline, quality rules, costs, and legal posture.

RAGSupabase + pgvectorEmbeddings: text-embedding-3-smallModel: gpt-4o-miniCitations on every live answer

Pipeline (today)

Ingest — PDFs are uploaded to a private Supabase Storage bucket. We record metadata (title, publisher, year, language, jurisdiction, topics) in documents.
Parse — Server code extracts text (pdf.js), normalizes Unicode, and splits into overlapping chunks (~4.5k chars, 400 overlap). Results go to chunks.
Embed — One embedding per chunk (1536-dim). Stored in chunks.embedding. One-time per document version.
Retrieve — A question is embedded and matched via a cosine search function (match_chunks) to fetch the top-K snippets.
Compose — The model must answer only from the retrieved snippets and cite them inline as [#1], [#2]. The UI links each marker to the source document.
FAQ cache — Reviewed answers can be promoted to an FAQ table with its own vector index so repeated queries are free and instant.

Evidence & Citations

Answers are grounded in retrieved chunks; thin or conflicting evidence is disclosed.
Citations are chunk-level today. Page spans and quote highlighting are on the roadmap.
We prefer primary sources (standards, laws, field trials, meta-analyses); grey literature is marked.

Quality & Limitations

Parsing: scanned PDFs require OCR (planned); those won’t yield chunks yet.
Coverage: the atlas “knows” only what’s ingested. Missing years/jurisdictions → missing answers.
Model bias: low temperature & “answer from context” reduce hallucinations, but citations should always be checked.
Conflicts: when sources disagree, we surface the split and avoid declaring a winner without criteria (date, N, design, jurisdiction).

Privacy & Security

Uploads live in a private bucket; tables use Row-Level Security; admin routes use a server key.
Public access is gated (basic auth / invite) while in early access.
Questions and answers may be logged internally to improve the FAQ. Private uploads aren’t shared beyond model providers used for embeddings/answers.

Cost Transparency

Parsing is local (no per-token cost).
Embeddings are a one-time cost per chunk; re-embed only if text or the embedding model changes.
Each new question = 1 query embedding + 1 chat call. FAQ hits are free.

Legal & Copyright

Attribution by design: every live answer cites and links the underlying source.
Source scope: we ingest public materials or items you upload with rights to use. For private/licensed content, access remains restricted to authorized users.
Robots & terms: when crawling, we respect robots.txt and site terms. We store text snippets for search, not whole-work republication.
Fair-use posture: the UI shows short excerpts necessary for retrieval + citation. We avoid bulk display or redistribution of full PDFs unless they’re openly licensed.
Takedown / opt-out: rights holders can request removal or stricter access by sending an email to our team.
Not legal advice: outputs are informational and may be jurisdiction-specific. For compliance decisions, consult official texts and qualified counsel.

Roadmap

OCR for scanned PDFs with language detection.
Page-accurate citations + snippet highlighting.
Trust/Stability signals (source age, study design, N, jurisdiction currency).
Law versioning with “valid through” metadata per country/standard.
Admin review → promote Live answers to FAQ with diff history.
Graph/nebula view of topics, papers, and legal relationships.

“Why this answer?” (method notes)

We rank chunks by cosine similarity to your question embedding and cap K (default 8). The prompt instructs the model to use only those snippets. Inline markers [#n] map to ranked chunks so you can inspect the raw text in context.

For higher-stakes guidance (e.g., agronomy), we can enable a chain-of-critique mode that lists assumptions and evidence quality before recommendations.