kmds.search package

Submodules

kmds.search.semantic_index module

Semantic index for KMDS knowledge bases.

Embeds observation findings from an RDF knowledge graph into a vector database so that natural language queries can retrieve relevant findings.

Technology choices

sentence-transformers (all-MiniLM-L6-v2 by default): fast, open-source embedding model with a large community footprint.
ChromaDB: open-source, Python-native vector database that supports both in-memory (ephemeral) and on-disk (persistent) operation.

Example usage

from kmds.search import SemanticIndex

# Build and persist an index from a KB file
idx = SemanticIndex(persist_dir="./my_index")
idx.build("path/to/kb.xml")

# Query the index
results = idx.search("missing values in intake data", n_results=5)
for r in results:
    print(r["obs_type"], r["finding"])

# Load a previously persisted index and search without rebuilding
idx2 = SemanticIndex(persist_dir="./my_index")
results = idx2.search("imputation strategy")

class kmds.search.semantic_index.SemanticIndex(persist_dir: str | None = None, model_name: str = 'all-MiniLM-L6-v2')

Bases: object

Semantic vector index over a KMDS knowledge base.

Parameters:

persist_dir – Directory on disk where ChromaDB will persist the index. If None the index lives in memory only and is lost when the object is garbage collected.
model_name – HuggingFace sentence-transformers model used for embedding. Defaults to "all-MiniLM-L6-v2", a fast, lightweight model with good semantic retrieval quality.

DEFAULT_MODEL: str = 'all-MiniLM-L6-v2'

build(kb_path: str) → None

Load a KMDS knowledge-base file and index all observations.

Parameters:: kb_path – Path (or URL) to an RDF/OWL knowledge-base file produced by KMDS (*.xml).

build_from_onto(onto: Any) → None

Index all observations from an already-loaded ontology.

Parameters:: onto – An owlready2 Ontology object previously loaded with kmds.utils.load_utils.load_kb().

clear() → None: Remove all documents from the index.

count() → int: Return the number of documents currently in the index.

search(query: str, n_results: int = 5) → list[dict]

Retrieve the most semantically similar observations for query.

Parameters:

query – Natural language search string.
n_results – Maximum number of results to return.

Returns:

Each dict contains:

finding: The original observation text.
obs_type: Observation category (e.g. "Data Quality Observation").
workflow_name: Name of the workflow the observation belongs to.
finding_seq: Original sequence number of the finding (-1 for the workflow description entry).
distance: Cosine distance from the query vector (lower = more similar).
intent (optional): Intent tag when present on the observation.

Return type:

list[dict]

kmds.search.search_orchestrator module

LLM-driven search orchestrator for KMDS knowledge bases.

The orchestrator uses an LLM as a router to map a natural language user query to one of the structured observation-category search templates, executes the matched template, and synthesises the raw results into a concise natural language answer.

If the query cannot be mapped to any specific template the system falls back to semantic vector search across all indexed observations (ChromaDB + sentence- transformers).

Architecture

Step 1 – Context Injection

A tool description string that lists every available API template and its purpose is injected into the LLM routing prompt so the model always knows its options.

Step 2 – Intent Classification & Entity Extraction (LLM router)

The LLM returns a Pydantic-validated JSON payload identifying:

intent_class – which observation-category template to invoke.
filters – optional parameters extracted from the query text (obs type, keyword, sequence range).
explanation – one-sentence rationale for transparency.

Step 3 – Template Execution

The corresponding KMDS API function is called with the extracted filters.

Step 4 – LLM Synthesis

The LLM converts the raw observation records into a readable answer.

Step 5 – Semantic Fallback (catch-all)

When no template matches, or when the LLM or a template returns no results, the SemanticIndex is queried instead.

class kmds.search.search_orchestrator.OrchestratorResult(answer: str, intent_class: str, route_explanation: str, results: list[dict[str, Any]])

Bases: object

Result returned by SearchOrchestrator.ask().

answer: Synthesised natural language answer.

intent_class: The search template that was ultimately executed.

route_explanation: The LLM’s own explanation for its routing choice.

results: Raw observation record dicts that informed the answer.

class kmds.search.search_orchestrator.OrchestratorRoute(*, intent_class: Literal['exploratory_search', 'data_representation_search', 'modelling_choice_search', 'model_selection_search', 'all_observations_search', 'semantic_search'], filters: SearchFilters = SearchFilters(obs_type_filter=None, finding_seq_min=None, finding_seq_max=None, keyword=None), explanation: str = '')

Bases: BaseModel

Structured output produced by the LLM router.

The LLM is instructed to return a JSON object that conforms to this schema. Pydantic validates and coerces the payload before execution.

explanation: str: Brief explanation of why this route was chosen (surfaced to the caller).

filters: SearchFilters: Optional query parameters extracted from the query text.

intent_class: IntentClass: Which search template best matches the user query.

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kmds.search.search_orchestrator.SearchFilters(*, obs_type_filter: str | None = None, finding_seq_min: int | None = None, finding_seq_max: int | None = None, keyword: str | None = None)

Bases: BaseModel

Optional parameters the LLM may extract from the user query.

All fields are optional. Any field left as None is ignored during the post-retrieval filtering step.

finding_seq_max: int | None: Include only observations with finding_seq <= this value.

finding_seq_min: int | None: Include only observations with finding_seq >= this value.

keyword: str | None: Additional keyword to filter the finding text (case-insensitive substring).

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

obs_type_filter: str | None: Substring to match against the observation-type label (case-insensitive).

class kmds.search.search_orchestrator.SearchOrchestrator(kb_path: str, *, persist_dir: str | None = None, llm_fn: Callable[[str], str] | None = None, model: str = 'gemini-1.5-flash', embedding_model: str = 'all-MiniLM-L6-v2', n_results: int = 5)

Bases: object

LLM-driven search orchestrator for a KMDS knowledge base.

The orchestrator routes natural language queries through an LLM to identify the best search template, executes it against the loaded knowledge base, and synthesises the results into a natural language answer.

Parameters:

kb_path – Path to the KMDS .xml knowledge-base file.
persist_dir – Directory to persist the semantic vector index. None keeps the index in memory (rebuilt on each interpreter session).
llm_fn – Optional callable (prompt: str) -> str for your own LLM backend. If None, the orchestrator uses Google GenAI (requires GOOGLE_API_KEY environment variable).
model – Google GenAI model name (ignored when llm_fn is supplied).
embedding_model – Sentence-transformers model used for the semantic fallback index.
n_results – Default maximum number of observation records returned per query.

Examples

Using Google GenAI (default):

import os
os.environ["GOOGLE_API_KEY"] = "your-key"

from kmds.search import SearchOrchestrator

orc = SearchOrchestrator("my_project.xml", persist_dir="./idx")
result = orc.ask("What data quality issues were found?")
print(result.answer)

Using a custom LLM backend:

def my_llm(prompt: str) -> str:
    # call any LLM here
    return my_model.generate(prompt)

orc = SearchOrchestrator("my_project.xml", llm_fn=my_llm)
result = orc.ask("Which model was selected and why?")
print(result.answer)
print(result.results)      # raw records

ask(query: str) → OrchestratorResult

Route a natural language query and return a synthesised answer.

This is the single public entry point for the orchestrator. It performs all five steps internally (routing, execution, synthesis, fallback) and returns an OrchestratorResult.

Parameters:: query – Free-form natural language question about the knowledge base.
Return type:: OrchestratorResult

Module contents

class kmds.search.OrchestratorResult(answer: str, intent_class: str, route_explanation: str, results: list[dict[str, Any]])

Bases: object

Result returned by SearchOrchestrator.ask().

answer: Synthesised natural language answer.

intent_class: The search template that was ultimately executed.

route_explanation: The LLM’s own explanation for its routing choice.

results: Raw observation record dicts that informed the answer.

class kmds.search.OrchestratorRoute(*, intent_class: Literal['exploratory_search', 'data_representation_search', 'modelling_choice_search', 'model_selection_search', 'all_observations_search', 'semantic_search'], filters: SearchFilters = SearchFilters(obs_type_filter=None, finding_seq_min=None, finding_seq_max=None, keyword=None), explanation: str = '')

Bases: BaseModel

Structured output produced by the LLM router.

The LLM is instructed to return a JSON object that conforms to this schema. Pydantic validates and coerces the payload before execution.

explanation: str: Brief explanation of why this route was chosen (surfaced to the caller).

filters: SearchFilters: Optional query parameters extracted from the query text.

intent_class: IntentClass: Which search template best matches the user query.

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kmds.search.SearchFilters(*, obs_type_filter: str | None = None, finding_seq_min: int | None = None, finding_seq_max: int | None = None, keyword: str | None = None)

Bases: BaseModel

Optional parameters the LLM may extract from the user query.

All fields are optional. Any field left as None is ignored during the post-retrieval filtering step.

finding_seq_max: int | None: Include only observations with finding_seq <= this value.

finding_seq_min: int | None: Include only observations with finding_seq >= this value.

keyword: str | None: Additional keyword to filter the finding text (case-insensitive substring).

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

obs_type_filter: str | None: Substring to match against the observation-type label (case-insensitive).

class kmds.search.SearchOrchestrator(kb_path: str, *, persist_dir: str | None = None, llm_fn: Callable[[str], str] | None = None, model: str = 'gemini-1.5-flash', embedding_model: str = 'all-MiniLM-L6-v2', n_results: int = 5)

Bases: object

LLM-driven search orchestrator for a KMDS knowledge base.

The orchestrator routes natural language queries through an LLM to identify the best search template, executes it against the loaded knowledge base, and synthesises the results into a natural language answer.

Parameters:

kb_path – Path to the KMDS .xml knowledge-base file.
persist_dir – Directory to persist the semantic vector index. None keeps the index in memory (rebuilt on each interpreter session).
llm_fn – Optional callable (prompt: str) -> str for your own LLM backend. If None, the orchestrator uses Google GenAI (requires GOOGLE_API_KEY environment variable).
model – Google GenAI model name (ignored when llm_fn is supplied).
embedding_model – Sentence-transformers model used for the semantic fallback index.
n_results – Default maximum number of observation records returned per query.

Examples

Using Google GenAI (default):

import os
os.environ["GOOGLE_API_KEY"] = "your-key"

from kmds.search import SearchOrchestrator

orc = SearchOrchestrator("my_project.xml", persist_dir="./idx")
result = orc.ask("What data quality issues were found?")
print(result.answer)

Using a custom LLM backend:

def my_llm(prompt: str) -> str:
    # call any LLM here
    return my_model.generate(prompt)

orc = SearchOrchestrator("my_project.xml", llm_fn=my_llm)
result = orc.ask("Which model was selected and why?")
print(result.answer)
print(result.results)      # raw records

ask(query: str) → OrchestratorResult

Route a natural language query and return a synthesised answer.

This is the single public entry point for the orchestrator. It performs all five steps internally (routing, execution, synthesis, fallback) and returns an OrchestratorResult.

Parameters:: query – Free-form natural language question about the knowledge base.
Return type:: OrchestratorResult

class kmds.search.SemanticIndex(persist_dir: str | None = None, model_name: str = 'all-MiniLM-L6-v2')

Bases: object

Semantic vector index over a KMDS knowledge base.

Parameters:

persist_dir – Directory on disk where ChromaDB will persist the index. If None the index lives in memory only and is lost when the object is garbage collected.
model_name – HuggingFace sentence-transformers model used for embedding. Defaults to "all-MiniLM-L6-v2", a fast, lightweight model with good semantic retrieval quality.

DEFAULT_MODEL: str = 'all-MiniLM-L6-v2'

build(kb_path: str) → None

Load a KMDS knowledge-base file and index all observations.

Parameters:: kb_path – Path (or URL) to an RDF/OWL knowledge-base file produced by KMDS (*.xml).

build_from_onto(onto: Any) → None

Index all observations from an already-loaded ontology.

Parameters:: onto – An owlready2 Ontology object previously loaded with kmds.utils.load_utils.load_kb().

clear() → None: Remove all documents from the index.

count() → int: Return the number of documents currently in the index.

search(query: str, n_results: int = 5) → list[dict]

Retrieve the most semantically similar observations for query.

Parameters:

query – Natural language search string.
n_results – Maximum number of results to return.

Returns:

Each dict contains:

finding: The original observation text.
obs_type: Observation category (e.g. "Data Quality Observation").
workflow_name: Name of the workflow the observation belongs to.
finding_seq: Original sequence number of the finding (-1 for the workflow description entry).
distance: Cosine distance from the query vector (lower = more similar).
intent (optional): Intent tag when present on the observation.

Return type:

list[dict]