kmds.search package
Submodules
kmds.search.semantic_index module
Semantic index for KMDS knowledge bases.
Embeds observation findings from an RDF knowledge graph into a vector database so that natural language queries can retrieve relevant findings.
Technology choices
sentence-transformers (
all-MiniLM-L6-v2by default): fast, open-source embedding model with a large community footprint.ChromaDB: open-source, Python-native vector database that supports both in-memory (ephemeral) and on-disk (persistent) operation.
Example usage
from kmds.search import SemanticIndex
# Build and persist an index from a KB file
idx = SemanticIndex(persist_dir="./my_index")
idx.build("path/to/kb.xml")
# Query the index
results = idx.search("missing values in intake data", n_results=5)
for r in results:
print(r["obs_type"], r["finding"])
# Load a previously persisted index and search without rebuilding
idx2 = SemanticIndex(persist_dir="./my_index")
results = idx2.search("imputation strategy")
- class kmds.search.semantic_index.SemanticIndex(persist_dir: str | None = None, model_name: str = 'all-MiniLM-L6-v2')
Bases:
objectSemantic vector index over a KMDS knowledge base.
- Parameters:
persist_dir – Directory on disk where ChromaDB will persist the index. If
Nonethe index lives in memory only and is lost when the object is garbage collected.model_name – HuggingFace sentence-transformers model used for embedding. Defaults to
"all-MiniLM-L6-v2", a fast, lightweight model with good semantic retrieval quality.
- DEFAULT_MODEL: str = 'all-MiniLM-L6-v2'
- build(kb_path: str) None
Load a KMDS knowledge-base file and index all observations.
- Parameters:
kb_path – Path (or URL) to an RDF/OWL knowledge-base file produced by KMDS (
*.xml).
- build_from_onto(onto: Any) None
Index all observations from an already-loaded ontology.
- Parameters:
onto – An owlready2
Ontologyobject previously loaded withkmds.utils.load_utils.load_kb().
- clear() None
Remove all documents from the index.
- count() int
Return the number of documents currently in the index.
- search(query: str, n_results: int = 5) list[dict]
Retrieve the most semantically similar observations for query.
- Parameters:
query – Natural language search string.
n_results – Maximum number of results to return.
- Returns:
Each dict contains:
findingThe original observation text.
obs_typeObservation category (e.g.
"Data Quality Observation").workflow_nameName of the workflow the observation belongs to.
finding_seqOriginal sequence number of the finding (
-1for the workflow description entry).distanceCosine distance from the query vector (lower = more similar).
intent(optional)Intent tag when present on the observation.
- Return type:
list[dict]
kmds.search.search_orchestrator module
LLM-driven search orchestrator for KMDS knowledge bases.
The orchestrator uses an LLM as a router to map a natural language user query to one of the structured observation-category search templates, executes the matched template, and synthesises the raw results into a concise natural language answer.
If the query cannot be mapped to any specific template the system falls back to semantic vector search across all indexed observations (ChromaDB + sentence- transformers).
Architecture
- Step 1 – Context Injection
A
tool descriptionstring that lists every available API template and its purpose is injected into the LLM routing prompt so the model always knows its options.- Step 2 – Intent Classification & Entity Extraction (LLM router)
The LLM returns a Pydantic-validated JSON payload identifying:
intent_class– which observation-category template to invoke.filters– optional parameters extracted from the query text (obs type, keyword, sequence range).explanation– one-sentence rationale for transparency.
- Step 3 – Template Execution
The corresponding KMDS API function is called with the extracted filters.
- Step 4 – LLM Synthesis
The LLM converts the raw observation records into a readable answer.
- Step 5 – Semantic Fallback (catch-all)
When no template matches, or when the LLM or a template returns no results, the SemanticIndex is queried instead.
- class kmds.search.search_orchestrator.OrchestratorResult(answer: str, intent_class: str, route_explanation: str, results: list[dict[str, Any]])
Bases:
objectResult returned by
SearchOrchestrator.ask().- answer
Synthesised natural language answer.
- intent_class
The search template that was ultimately executed.
- route_explanation
The LLM’s own explanation for its routing choice.
- results
Raw observation record dicts that informed the answer.
- class kmds.search.search_orchestrator.OrchestratorRoute(*, intent_class: Literal['exploratory_search', 'data_representation_search', 'modelling_choice_search', 'model_selection_search', 'all_observations_search', 'semantic_search'], filters: SearchFilters = SearchFilters(obs_type_filter=None, finding_seq_min=None, finding_seq_max=None, keyword=None), explanation: str = '')
Bases:
BaseModelStructured output produced by the LLM router.
The LLM is instructed to return a JSON object that conforms to this schema. Pydantic validates and coerces the payload before execution.
- explanation: str
Brief explanation of why this route was chosen (surfaced to the caller).
- filters: SearchFilters
Optional query parameters extracted from the query text.
- intent_class: IntentClass
Which search template best matches the user query.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class kmds.search.search_orchestrator.SearchFilters(*, obs_type_filter: str | None = None, finding_seq_min: int | None = None, finding_seq_max: int | None = None, keyword: str | None = None)
Bases:
BaseModelOptional parameters the LLM may extract from the user query.
All fields are optional. Any field left as
Noneis ignored during the post-retrieval filtering step.- finding_seq_max: int | None
Include only observations with
finding_seq<= this value.
- finding_seq_min: int | None
Include only observations with
finding_seq>= this value.
- keyword: str | None
Additional keyword to filter the finding text (case-insensitive substring).
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- obs_type_filter: str | None
Substring to match against the observation-type label (case-insensitive).
- class kmds.search.search_orchestrator.SearchOrchestrator(kb_path: str, *, persist_dir: str | None = None, llm_fn: Callable[[str], str] | None = None, model: str = 'gemini-1.5-flash', embedding_model: str = 'all-MiniLM-L6-v2', n_results: int = 5)
Bases:
objectLLM-driven search orchestrator for a KMDS knowledge base.
The orchestrator routes natural language queries through an LLM to identify the best search template, executes it against the loaded knowledge base, and synthesises the results into a natural language answer.
- Parameters:
kb_path – Path to the KMDS
.xmlknowledge-base file.persist_dir – Directory to persist the semantic vector index.
Nonekeeps the index in memory (rebuilt on each interpreter session).llm_fn – Optional callable
(prompt: str) -> strfor your own LLM backend. IfNone, the orchestrator uses Google GenAI (requiresGOOGLE_API_KEYenvironment variable).model – Google GenAI model name (ignored when llm_fn is supplied).
embedding_model – Sentence-transformers model used for the semantic fallback index.
n_results – Default maximum number of observation records returned per query.
Examples
Using Google GenAI (default):
import os os.environ["GOOGLE_API_KEY"] = "your-key" from kmds.search import SearchOrchestrator orc = SearchOrchestrator("my_project.xml", persist_dir="./idx") result = orc.ask("What data quality issues were found?") print(result.answer)
Using a custom LLM backend:
def my_llm(prompt: str) -> str: # call any LLM here return my_model.generate(prompt) orc = SearchOrchestrator("my_project.xml", llm_fn=my_llm) result = orc.ask("Which model was selected and why?") print(result.answer) print(result.results) # raw records
- ask(query: str) OrchestratorResult
Route a natural language query and return a synthesised answer.
This is the single public entry point for the orchestrator. It performs all five steps internally (routing, execution, synthesis, fallback) and returns an
OrchestratorResult.- Parameters:
query – Free-form natural language question about the knowledge base.
- Return type:
Module contents
- class kmds.search.OrchestratorResult(answer: str, intent_class: str, route_explanation: str, results: list[dict[str, Any]])
Bases:
objectResult returned by
SearchOrchestrator.ask().- answer
Synthesised natural language answer.
- intent_class
The search template that was ultimately executed.
- route_explanation
The LLM’s own explanation for its routing choice.
- results
Raw observation record dicts that informed the answer.
- class kmds.search.OrchestratorRoute(*, intent_class: Literal['exploratory_search', 'data_representation_search', 'modelling_choice_search', 'model_selection_search', 'all_observations_search', 'semantic_search'], filters: SearchFilters = SearchFilters(obs_type_filter=None, finding_seq_min=None, finding_seq_max=None, keyword=None), explanation: str = '')
Bases:
BaseModelStructured output produced by the LLM router.
The LLM is instructed to return a JSON object that conforms to this schema. Pydantic validates and coerces the payload before execution.
- explanation: str
Brief explanation of why this route was chosen (surfaced to the caller).
- filters: SearchFilters
Optional query parameters extracted from the query text.
- intent_class: IntentClass
Which search template best matches the user query.
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class kmds.search.SearchFilters(*, obs_type_filter: str | None = None, finding_seq_min: int | None = None, finding_seq_max: int | None = None, keyword: str | None = None)
Bases:
BaseModelOptional parameters the LLM may extract from the user query.
All fields are optional. Any field left as
Noneis ignored during the post-retrieval filtering step.- finding_seq_max: int | None
Include only observations with
finding_seq<= this value.
- finding_seq_min: int | None
Include only observations with
finding_seq>= this value.
- keyword: str | None
Additional keyword to filter the finding text (case-insensitive substring).
- model_config = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- obs_type_filter: str | None
Substring to match against the observation-type label (case-insensitive).
- class kmds.search.SearchOrchestrator(kb_path: str, *, persist_dir: str | None = None, llm_fn: Callable[[str], str] | None = None, model: str = 'gemini-1.5-flash', embedding_model: str = 'all-MiniLM-L6-v2', n_results: int = 5)
Bases:
objectLLM-driven search orchestrator for a KMDS knowledge base.
The orchestrator routes natural language queries through an LLM to identify the best search template, executes it against the loaded knowledge base, and synthesises the results into a natural language answer.
- Parameters:
kb_path – Path to the KMDS
.xmlknowledge-base file.persist_dir – Directory to persist the semantic vector index.
Nonekeeps the index in memory (rebuilt on each interpreter session).llm_fn – Optional callable
(prompt: str) -> strfor your own LLM backend. IfNone, the orchestrator uses Google GenAI (requiresGOOGLE_API_KEYenvironment variable).model – Google GenAI model name (ignored when llm_fn is supplied).
embedding_model – Sentence-transformers model used for the semantic fallback index.
n_results – Default maximum number of observation records returned per query.
Examples
Using Google GenAI (default):
import os os.environ["GOOGLE_API_KEY"] = "your-key" from kmds.search import SearchOrchestrator orc = SearchOrchestrator("my_project.xml", persist_dir="./idx") result = orc.ask("What data quality issues were found?") print(result.answer)
Using a custom LLM backend:
def my_llm(prompt: str) -> str: # call any LLM here return my_model.generate(prompt) orc = SearchOrchestrator("my_project.xml", llm_fn=my_llm) result = orc.ask("Which model was selected and why?") print(result.answer) print(result.results) # raw records
- ask(query: str) OrchestratorResult
Route a natural language query and return a synthesised answer.
This is the single public entry point for the orchestrator. It performs all five steps internally (routing, execution, synthesis, fallback) and returns an
OrchestratorResult.- Parameters:
query – Free-form natural language question about the knowledge base.
- Return type:
- class kmds.search.SemanticIndex(persist_dir: str | None = None, model_name: str = 'all-MiniLM-L6-v2')
Bases:
objectSemantic vector index over a KMDS knowledge base.
- Parameters:
persist_dir – Directory on disk where ChromaDB will persist the index. If
Nonethe index lives in memory only and is lost when the object is garbage collected.model_name – HuggingFace sentence-transformers model used for embedding. Defaults to
"all-MiniLM-L6-v2", a fast, lightweight model with good semantic retrieval quality.
- DEFAULT_MODEL: str = 'all-MiniLM-L6-v2'
- build(kb_path: str) None
Load a KMDS knowledge-base file and index all observations.
- Parameters:
kb_path – Path (or URL) to an RDF/OWL knowledge-base file produced by KMDS (
*.xml).
- build_from_onto(onto: Any) None
Index all observations from an already-loaded ontology.
- Parameters:
onto – An owlready2
Ontologyobject previously loaded withkmds.utils.load_utils.load_kb().
- clear() None
Remove all documents from the index.
- count() int
Return the number of documents currently in the index.
- search(query: str, n_results: int = 5) list[dict]
Retrieve the most semantically similar observations for query.
- Parameters:
query – Natural language search string.
n_results – Maximum number of results to return.
- Returns:
Each dict contains:
findingThe original observation text.
obs_typeObservation category (e.g.
"Data Quality Observation").workflow_nameName of the workflow the observation belongs to.
finding_seqOriginal sequence number of the finding (
-1for the workflow description entry).distanceCosine distance from the query vector (lower = more similar).
intent(optional)Intent tag when present on the observation.
- Return type:
list[dict]