Dialogs can be evaluated using the different components available inside the `sdialog.evaluation` module. Use built-in metrics (readability, flow, linguistic features, LLM judges) or easily create new ones, then aggregate and compare datasets (sets of dialogs) via `DatasetComparator`.
Evaluation Capabilities
- Rich Built-in Metrics: A wide range of metrics for readability, dialogue flow, linguistic features, and more.
- LLM-as-a-Judge: Leverage large language models to provide human-like judgments on dialogue quality.
- Custom Metrics: Easily extend the framework by creating and integrating your own evaluation metrics.
- Dataset Comparison: Use the `DatasetComparator` to aggregate results and generate plots for easy comparison between different sets of dialogues.
Usage Example
from sdialog.evaluation import LLMJudgeRealDialog, LinguisticFeatureScore
from sdialog.evaluation import FrequencyEvaluator, MeanEvaluator
from sdialog.evaluation import DatasetComparator
from sdialog import Dialog
# Assume reference and candidate are lists of Dialog objects
reference = [Dialog.from_file("ref_dialog_1.json"), ...]
candidate = [Dialog.from_file("cand_dialog_1.json"), ...]
# Initialize evaluators
judge = LLMJudgeRealDialog()
flesch_scorer = LinguisticFeatureScore(feature="flesch-reading-ease")
# Create a comparator with different evaluation strategies
comparator = DatasetComparator([
FrequencyEvaluator(judge, name="Realistic dialog rate"),
MeanEvaluator(flesch_scorer, name="Mean Flesch Reading Ease"),
])
# Run the comparison
results = comparator({"reference": reference, "candidate": candidate})
print(results)
# Plot results for each evaluator for a visual comparison
comparator.plot()