Methods & Trust

How knowledge.hempin.org turns documents into cited answers. This page is a living spec of our pipeline, quality rules, costs, and legal posture.

RAGSupabase + pgvectorEmbeddings: text-embedding-3-smallModel: gpt-4o-miniCitations on every live answer

Pipeline (today)

  1. Ingest — PDFs are uploaded to a private Supabase Storage bucket. We record metadata (title, publisher, year, language, jurisdiction, topics) in documents.
  2. Parse — Server code extracts text (pdf.js), normalizes Unicode, and splits into overlapping chunks (~4.5k chars, 400 overlap). Results go to chunks.
  3. Embed — One embedding per chunk (1536-dim). Stored in chunks.embedding. One-time per document version.
  4. Retrieve — A question is embedded and matched via a cosine search function (match_chunks) to fetch the top-K snippets.
  5. Compose — The model must answer only from the retrieved snippets and cite them inline as [#1], [#2]. The UI links each marker to the source document.
  6. FAQ cache — Reviewed answers can be promoted to an FAQ table with its own vector index so repeated queries are free and instant.

Evidence & Citations

  • Answers are grounded in retrieved chunks; thin or conflicting evidence is disclosed.
  • Citations are chunk-level today. Page spans and quote highlighting are on the roadmap.
  • We prefer primary sources (standards, laws, field trials, meta-analyses); grey literature is marked.

Quality & Limitations

  • Parsing: scanned PDFs require OCR (planned); those won’t yield chunks yet.
  • Coverage: the atlas “knows” only what’s ingested. Missing years/jurisdictions → missing answers.
  • Model bias: low temperature & “answer from context” reduce hallucinations, but citations should always be checked.
  • Conflicts: when sources disagree, we surface the split and avoid declaring a winner without criteria (date, N, design, jurisdiction).

Privacy & Security

  • Uploads live in a private bucket; tables use Row-Level Security; admin routes use a server key.
  • Public access is gated (basic auth / invite) while in early access.
  • Questions and answers may be logged internally to improve the FAQ. Private uploads aren’t shared beyond model providers used for embeddings/answers.

Cost Transparency

  • Parsing is local (no per-token cost).
  • Embeddings are a one-time cost per chunk; re-embed only if text or the embedding model changes.
  • Each new question = 1 query embedding + 1 chat call. FAQ hits are free.

Legal & Copyright

  • Attribution by design: every live answer cites and links the underlying source.
  • Source scope: we ingest public materials or items you upload with rights to use. For private/licensed content, access remains restricted to authorized users.
  • Robots & terms: when crawling, we respect robots.txt and site terms. We store text snippets for search, not whole-work republication.
  • Fair-use posture: the UI shows short excerpts necessary for retrieval + citation. We avoid bulk display or redistribution of full PDFs unless they’re openly licensed.
  • Takedown / opt-out: rights holders can request removal or stricter access by sending an email to our team.
  • Not legal advice: outputs are informational and may be jurisdiction-specific. For compliance decisions, consult official texts and qualified counsel.

Roadmap

  • OCR for scanned PDFs with language detection.
  • Page-accurate citations + snippet highlighting.
  • Trust/Stability signals (source age, study design, N, jurisdiction currency).
  • Law versioning with “valid through” metadata per country/standard.
  • Admin review → promote Live answers to FAQ with diff history.
  • Graph/nebula view of topics, papers, and legal relationships.
“Why this answer?” (method notes)

We rank chunks by cosine similarity to your question embedding and cap K (default 8). The prompt instructs the model to use only those snippets. Inline markers [#n] map to ranked chunks so you can inspect the raw text in context.

For higher-stakes guidance (e.g., agronomy), we can enable a chain-of-critique mode that lists assumptions and evidence quality before recommendations.