Building a Smart Compliance Assistant with Gemini and Semantic Search
Compliance documentation is often the intellectual equivalent of a root canal. Necessary? Absolutely. Enjoyable? Let’s just say there’s room for improvement. If you’ve ever had to comb through GDPR policies, data protection agreements, or internal risk documents, you’ll know that extracting meaningful insights quickly is like panning for gold in a sea of jargon.
I have recently completed the 5-day GenAI course with Google and Kaggle and my project to capstone the course is the AI Policy Advisor. A nimble, document-savvy assistant powered by Google’s Gemini API, semantic embeddings, and a pinch of automation magic.
This blog post outlines how we built a smart assistant capable of analysing compliance policies, surfacing risks, and offering structured summaries. The goal? Free up legal and compliance teams from soul-crushing document review and let AI shoulder the burden. We’ll explore the motivation behind the project, walk through the technical components, and show you how it all fits together in a fast, intelligent pipeline.
The Problem: Compliance Overload
Regulatory and compliance work is text-heavy and time-consuming. GDPR alone has over 88 pages of tightly written regulations. Multiply that by various industry-specific frameworks (HIPAA, ISO 27001, FCA regulations), and you have a documentation nightmare. Often, policy teams need to:
- Identify clauses related to a specific regulation (e.g., GDPR’s Right to Erasure)
- Summarise implications for the business
- Highlight potential risk areas
This is not only laborious but also prone to oversight. In a world increasingly driven by data, the cost of missing a compliance detail can be astronomical.
So I asked: what if we could offload that initial document triage to a smart AI assistant?
The Ingredients: Gemini, Embeddings, and a Smart Pipeline
To build our assistant, we wanted three key ingredients:
- Semantic understanding of document content
- Structured, human-like responses
- A pipeline to orchestrate retrieval, reasoning, and reporting
Let’s break it down.
Step 1: Loading and Preprocessing Documents
We start with a folder full of compliance policies and load them into memory. Each document is treated as a text blob:
def load_documents(folder_path="./corpus"): docs = [] for fname in os.listdir(folder_path): if fname.endswith(".txt"): with open(os.path.join(folder_path, fname), "r", encoding="utf-8") as f: docs.append({"filename": fname, "content": f.read()}) return pd.DataFrame(docs) corpus_df = load_documents()
Each file becomes a row in a DataFrame, complete with filename and content.
Step 2: Semantic Search with Sentence Embeddings
We use the excellent sentence-transformers
library to convert each document into a high-dimensional vector using the all-MiniLM-L6-v2
model. This lets us compare documents based on meaning, not keywords.
embed_model = SentenceTransformer("all-MiniLM-L6-v2") corpus_df["embedding"] = corpus_df["content"].apply(lambda x: embed_model.encode(x))
For a given query like “What are the GDPR risks?”, we compute its embedding and retrieve the top-matching documents:
def semantic_search(query, top_k=3): q_emb = embed_model.encode(query).reshape(1, -1) similarities = corpus_df["embedding"].apply(lambda emb: cosine_similarity(q_emb, emb.reshape(1, -1))[0][0]) top_matches = corpus_df.iloc[similarities.nlargest(top_k).index] return top_matches[["filename", "content"]]
Step 3: Prompt Engineering and Gemini
Now the fun part: crafting a prompt for Gemini. We tell the model to behave like a compliance assistant and ask it to structure the output.
def prompt_summary(document_text): return f""" You are a compliance assistant. Summarise the following document and identify any potential risks related to GDPR: {document_text} Provide structured output in the format: Summary: Risks: Recommended Actions: """
We pass this to Gemini using Google’s Generative AI client:
import google.generativeai as genai genai.configure(api_key=os.environ["GOOGLE_API_KEY"]) model = genai.GenerativeModel("gemini-pro") def gemini_response(prompt): return model.generate_content(prompt).text
Gemini returns rich, human-readable output that reads like a professional wrote it (because in a way, one did).
Step 4: The Agent Pipeline
We bundle it all into a single pipeline function:
def agent_pipeline(user_query): relevant_docs = semantic_search(user_query) combined_text = "\n---\n".join(relevant_docs["content"].tolist()) prompt = prompt_summary(combined_text) return gemini_response(prompt)
And voilà. With one line, you can extract insights from a haystack of compliance docs:
response = agent_pipeline("What risks are covered in the GDPR policy?") print(response)
Why This Works
This capstone project works because it leverages several key principles of modern AI:
- Vector-based retrieval gives us relevance beyond keywords.
- Prompt-based generation allows for flexibility and expressiveness.
- Structured prompting ensures consistency and downstream usability.
The assistant doesn’t just summarise; it interprets. It spots risks, outlines actions, and adapts to your question.
Imagine asking: “What should we do next based on these risks?” and the assistant scheduling a task or drafting an email. We’re almost there.
Final Remarks
AI isn’t here to replace compliance officers, legal counsel, or policy analysts. But it can absolutely take the grunt work off their plates. With a well-crafted prompt, semantic search, and a generative engine like Gemini, we can build assistants that read, reason, and report at speed.
The AI Policy Advisor capstone project is more than a demo. It’s a proof of concept for how AI can become a colleague—a second set of eyes on dense documentation, a proactive analyst, and a structured summariser that works at the speed of thought.
So the next time you’re staring down a 30-page data policy, maybe let the assistant take the first pass. You’ve earned it.
Here is a video to accompany this post: