Evaluate
BaseModel
¶
Bases: Protocol
Protocol for pydantic BaseModel to ensure compatibility with context
Source code in python/scouter/evaluate/_evaluate.pyi
EvaluationConfig
¶
Configuration options for LLM evaluation.
Source code in python/scouter/evaluate/_evaluate.pyi
__init__(embedder=None, embedding_targets=None, compute_similarity=False, cluster=False, compute_histograms=False)
¶
Initialize the EvaluationConfig with optional parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedder
|
Optional[Embedder]
|
Optional Embedder instance to use for generating embeddings for similarity-based metrics. If not provided, no embeddings will be generated. |
None
|
embedding_targets
|
Optional[List[str]]
|
Optional list of context keys to generate embeddings for. If not provided, embeddings will be generated for all string fields in the record context. |
None
|
compute_similarity
|
bool
|
Whether to compute similarity between embeddings. Default is False. |
False
|
cluster
|
bool
|
Whether to perform clustering on the embeddings. Default is False. |
False
|
compute_histograms
|
bool
|
Whether to compute histograms for all calculated features (metrics, embeddings, similarities). Default is False. |
False
|
Source code in python/scouter/evaluate/_evaluate.pyi
LLMEvalMetric
¶
Defines an LLM eval metric to use when evaluating LLMs
Source code in python/scouter/evaluate/_evaluate.pyi
__init__(name, prompt)
¶
Initialize an LLMEvalMetric to use for evaluating LLMs. This is
most commonly used in conjunction with evaluate_llm
where LLM inputs
and responses can be evaluated against a variety of user-defined metrics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Name of the metric |
required |
prompt
|
Prompt
|
Prompt to use for the metric. For example, a user may create an accuracy analysis prompt or a query reformulation analysis prompt. |
required |
Source code in python/scouter/evaluate/_evaluate.pyi
LLMEvalRecord
¶
LLM record containing context tied to a Large Language Model interaction that is used to evaluate LLM responses.
Examples:
>>> record = LLMEvalRecord(
id="123",
context={
"input": "What is the capital of France?",
"response": "Paris is the capital of France."
},
... )
>>> print(record.context["input"])
"What is the capital of France?"
Source code in python/scouter/evaluate/_evaluate.pyi
context
property
¶
Get the contextual information.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
The context data as a Python object (deserialized from JSON). |
__init__(context, id=None)
¶
Creates a new LLM record to associate with an LLMDriftProfile
.
The record is sent to the Scouter
server via the ScouterQueue
and is
then used to inject context into the evaluation prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context
|
Context
|
Additional context information as a dictionary or a pydantic BaseModel. During evaluation,
this will be merged with the input and response data and passed to the assigned
evaluation prompts. So if you're evaluation prompts expect additional context via
bound variables (e.g., |
required |
id
|
Optional[str]
|
Unique identifier for the record. If not provided, a new UUID will be generated. This is helpful for when joining evaluation results back to the original request. |
None
|
Raises:
Type | Description |
---|---|
TypeError
|
If context is not a dict or a pydantic BaseModel. |
Source code in python/scouter/evaluate/_evaluate.pyi
LLMEvalResults
¶
Defines the results of an LLM eval metric
Source code in python/scouter/evaluate/_evaluate.pyi
errored_tasks
property
¶
Get a list of record IDs that had errors during evaluation
histograms
property
¶
Get histograms for all calculated features (metrics, embeddings, similarities)
__getitem__(key)
¶
Get the task results for a specific record ID. A RuntimeError will be raised if the record ID does not exist.
__str__()
¶
model_dump_json()
¶
model_validate_json(json_string)
staticmethod
¶
Validate and create an LLMEvalResults instance from a JSON string
Parameters:
Name | Type | Description | Default |
---|---|---|---|
json_string
|
str
|
JSON string to validate and create the LLMEvalResults instance from. |
required |
Source code in python/scouter/evaluate/_evaluate.pyi
to_dataframe(polars=False)
¶
Convert the results to a Pandas or Polars DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
polars
|
bool
|
Whether to return a Polars DataFrame. If False, a Pandas DataFrame will be returned. |
False
|
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
Any
|
A Pandas or Polars DataFrame containing the results. |
Source code in python/scouter/evaluate/_evaluate.pyi
LLMEvalTaskResult
¶
Eval Result for a specific evaluation
Source code in python/scouter/evaluate/_evaluate.pyi
evaluate_llm(records, metrics, config=None)
¶
Evaluate LLM responses using the provided evaluation metrics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
records
|
List[LLMEvalRecord]
|
List of LLM evaluation records to evaluate. |
required |
metrics
|
List[LLMEvalMetric]
|
List of LLMEvalMetric instances to use for evaluation. |
required |
config
|
Optional[EvaluationConfig]
|
Optional EvaluationConfig instance to configure evaluation options. |
None
|
Returns:
Type | Description |
---|---|
LLMEvalResults
|
LLMEvalResults |