Skip to content

For Contributors: Adding a New Data Schema

This tutorial explains how to add a new data schema. A data schema is used to define the structure of a dataset. The implementation example is based on the addition of ommx.v1.SampleSet during the upgrade from v1.0 to v1.1.

Implementing StorageStrategy

When adding a new data schema, you need to implement StorageStrategy in minto.v1.datastore.py. StorageStrategy is an interface for reading and writing datasets.

@dataclass
class SampleSetStorage(StorageStrategy[ommx_v1.SampleSet]):
    def save(self, data: ommx_v1.SampleSet, path: pathlib.Path):
        blob = data.to_bytes()
        with open(path, "wb") as f:
            f.write(blob)

    def load(self, path: pathlib.Path) -> ommx_v1.SampleSet:
        with open(path, "rb") as f:
            return ommx_v1.SampleSet.from_bytes(f.read())

    def add_to_artifact_builder(
        self,
        data: ommx_v1.SampleSet,
        builder: ox_art.ArtifactBuilder,
        annotations: dict[str, str],
    ):
        blob = data.to_bytes()
        builder.add_layer(
            "application/org.ommx.v1.sampleset", blob, annotations
        )

    def load_from_layer(self, artifact: ox_art.Artifact, layer: ox_art.Descriptor):
        blob = artifact.get_blob(layer)
        return ommx_v1.SampleSet.from_bytes(blob)

    @property
    def extension(self):
        return "sampleset"

SampleSetStorage inherits from StorageStrategy. StorageStrategy requires the implementation of five methods: save, load, add_to_artifact_builder, load_from_layer, and extension.

Next, register the added SampleSetStorage in DataStore._storage_mapping and an attribute in the DataStore class.

@dataclasses.dataclass
class DataStore:
    problems: dict[str, jm.Problem] = field(default_factory=dict)
    instances: dict[str, ommx_v1.Instance] = field(default_factory=dict)
    solutions: dict[str, ommx_v1.Solution] = field(default_factory=dict)
    objects: dict[str, dict] = field(default_factory=dict)
    parameters: dict[str, dict[str, typ.Any]] = field(default_factory=dict)
    samplesets: dict[str, ommx_v1.SampleSet] = field(default_factory=dict) # Add this line
    meta_data: dict[str, typ.Any] = field(default_factory=dict)

    _storage_mapping: typ.ClassVar[dict[str, StorageStrategy]] = {
        "problems": ProblemStorage(),
        "instances": InstanceStorage(),
        "solutions": SolutionStorage(),
        "objects": JSONStorage(),
        "parameters": JSONStorage(),
        "samplesets": SampleSetStorage(), # Add this line
        "meta_data": JSONStorage(),
    }

By doing this, DataStore will recognize the "sampleset" data schema and use SampleSetStorage to read and write datasets.

Adding log_sampleset Method to Experiment Class

Now that DataStore can recognize ommx.v1.SampleSet, the next step is to add Experiment.log_sampleset.

    def log_sampleset(
        self,
        name: str | ommx_v1.SampleSet,
        value: typ.Optional[ommx_v1.SampleSet] = None
    ):
        """Log a SampleSet to the experiment or run database."""

        # If name is specified, use it as the name.
        # If name is not specified and a SampleSet is passed as the first argument,
        # use the variable name of the SampleSet as the name.
        datastore = self.get_current_datastore()
        _name, _value = self._get_name_or_default(
            name, value, datastore.samplesets
        )

        self.log_data(_name, _value, "samplesets")

The _get_name_or_default method is used to determine the name and value. If name is specified, it is used as is; if not, the variable name of the SampleSet is used.

Using _name and _value, the SampleSet is logged to the database. By specifying "samplesets" as the third argument, DataStore can recognize ommx.v1.SampleSet.

Adding Conversion Method for Table Data

You can view the data of Experiment as a pandas.DataFrame using methods like Experiment.get_run_table or .get_experiment_tables. The conversion of each data type is described in minto/table.py. Here, we will add a method to convert SampleSet to pandas.DataFrame.

For example, the conversion method for ommx.v1.Solution is implemented as follows:

def _extract_solution_info(solution: ommx_v1.Solution):
    info = {
        "objective": solution.objective,
        "feasible": solution.feasible,
        "optimality": solution.optimality,
        "relaxation": solution.relaxation,
        "start": solution.start,
    }
    return info

Similarly, add a method to convert SampleSet.

def _extract_sampleset_info(sampleset: ommx_v1.SampleSet):
    summary = sampleset.summary
    objective = summary.objective
    return {
        "num_samples": len(summary),
        "obj_mean": objective.mean(),
        "obj_std": objective.std(),
        "obj_min": objective.min(),
        "obj_max": objective.max(),
        "feasible": summary.feasible.sum(),
        "feasible_unrelaxed": summary.feasible_unrelaxed.sum(),
    }

Then, implement this method to be called within create_table_info.

def create_table_info(datastore: DataStore) -> dict:
    instance_data = {}
    for name, inst in datastore.instances.items():
        instance_data[name] = _extract_instance_info(inst)

    solution_data = {}
    for name, sol in datastore.solutions.items():
        solution_data[name] = _extract_solution_info(sol)

    # Added code -------------------------
    sampleset_data = {}
    for name, sampleset in datastore.samplesets.items():
        sampleset_data[name] = _extract_sampleset_info(sampleset)
    # ------------------------- Added code

    return {
        "instance": instance_data,
        "solution": solution_data,
        "sampleset": sampleset_data,  # Added code
        "parameter": datastore.parameters,
        "metadata": datastore.meta_data,
    }

This adds a method to convert SampleSet to pandas.DataFrame.

Summary

Appropriately write test code under tests/ in between the above implementations. With the above implementation, you can log SampleSet to Experiment as follows:


experiment = Experiment()
with experiment.run():
    experiment.log_sampleset("sampleset", sampleset)
experiment.get_run_table()

Contributors can add new data schemas in this way.