For Contributors: Adding a New Data Schema#

This tutorial explains how to add a new data schema. A data schema is used to define the structure of a dataset. The implementation example is based on the addition of ommx.v1.SampleSet during the upgrade from v1.0 to v1.1.

Implementing StorageStrategy#

When adding a new data schema, you need to implement StorageStrategy in minto.v1.datastore.py. StorageStrategy is an interface for reading and writing datasets.

@dataclass
class SampleSetStorage(StorageStrategy[ommx_v1.SampleSet]):
    def save(self, data: ommx_v1.SampleSet, path: pathlib.Path):
        blob = data.to_bytes()
        with open(path, "wb") as f:
            f.write(blob)

    def load(self, path: pathlib.Path) -> ommx_v1.SampleSet:
        with open(path, "rb") as f:
            return ommx_v1.SampleSet.from_bytes(f.read())

    def add_to_artifact_builder(
        self,
        data: ommx_v1.SampleSet,
        builder: ox_art.ArtifactBuilder,
        annotations: dict[str, str],
    ):
        blob = data.to_bytes()
        builder.add_layer(
            "application/org.ommx.v1.sampleset", blob, annotations
        )

    def load_from_layer(self, artifact: ox_art.Artifact, layer: ox_art.Descriptor):
        blob = artifact.get_blob(layer)
        return ommx_v1.SampleSet.from_bytes(blob)

    @property
    def extension(self):
        return "sampleset"

SampleSetStorage inherits from StorageStrategy. StorageStrategy requires the implementation of five methods: save, load, add_to_artifact_builder, load_from_layer, and extension.

Next, register the added SampleSetStorage in DataStore._storage_mapping and an attribute in the DataStore class.

@dataclasses.dataclass
class DataStore:
    problems: dict[str, jm.Problem] = field(default_factory=dict)
    instances: dict[str, ommx_v1.Instance] = field(default_factory=dict)
    solutions: dict[str, ommx_v1.Solution] = field(default_factory=dict)
    objects: dict[str, dict] = field(default_factory=dict)
    parameters: dict[str, dict[str, typ.Any]] = field(default_factory=dict)
    samplesets: dict[str, ommx_v1.SampleSet] = field(default_factory=dict) # Add this line
    meta_data: dict[str, typ.Any] = field(default_factory=dict)

    _storage_mapping: typ.ClassVar[dict[str, StorageStrategy]] = {
        "problems": ProblemStorage(),
        "instances": InstanceStorage(),
        "solutions": SolutionStorage(),
        "objects": JSONStorage(),
        "parameters": JSONStorage(),
        "samplesets": SampleSetStorage(), # Add this line
        "meta_data": JSONStorage(),
    }

By doing this, DataStore will recognize the "sampleset" data schema and use SampleSetStorage to read and write datasets.

Adding log_sampleset Method to Experiment Class#

Now that DataStore can recognize ommx.v1.SampleSet, the next step is to add Experiment.log_sampleset.

    def log_sampleset(
        self,
        name: str | ommx_v1.SampleSet,
        value: typ.Optional[ommx_v1.SampleSet] = None
    ):
        """Log a SampleSet to the experiment or run database."""

        # If name is specified, use it as the name.
        # If name is not specified and a SampleSet is passed as the first argument,
        # use the variable name of the SampleSet as the name.
        datastore = self.get_current_datastore()
        _name, _value = self._get_name_or_default(
            name, value, datastore.samplesets
        )

        self.log_data(_name, _value, "samplesets")

The _get_name_or_default method is used to determine the name and value. If name is specified, it is used as is; if not, the variable name of the SampleSet is used.

Using _name and _value, the SampleSet is logged to the database. By specifying "samplesets" as the third argument, DataStore can recognize ommx.v1.SampleSet.

Adding Conversion Method for Table Data#

You can view the data of Experiment as a pandas.DataFrame using methods like Experiment.get_run_table or .get_experiment_tables. The conversion of each data type is described in minto/table.py. Here, we will add a method to convert SampleSet to pandas.DataFrame.

For example, the conversion method for ommx.v1.Solution is implemented as follows:

def _extract_solution_info(solution: ommx_v1.Solution):
    info = {
        "objective": solution.objective,
        "feasible": solution.feasible,
        "optimality": solution.optimality,
        "relaxation": solution.relaxation,
        "start": solution.start,
    }
    return info

Similarly, add a method to convert SampleSet.

def _extract_sampleset_info(sampleset: ommx_v1.SampleSet):
    summary = sampleset.summary
    objective = summary.objective
    return {
        "num_samples": len(summary),
        "obj_mean": objective.mean(),
        "obj_std": objective.std(),
        "obj_min": objective.min(),
        "obj_max": objective.max(),
        "feasible": summary.feasible.sum(),
        "feasible_unrelaxed": summary.feasible_unrelaxed.sum(),
    }

Then, implement this method to be called within create_table_info.

def create_table_info(datastore: DataStore) -> dict:
    instance_data = {}
    for name, inst in datastore.instances.items():
        instance_data[name] = _extract_instance_info(inst)

    solution_data = {}
    for name, sol in datastore.solutions.items():
        solution_data[name] = _extract_solution_info(sol)

    # Added code -------------------------
    sampleset_data = {}
    for name, sampleset in datastore.samplesets.items():
        sampleset_data[name] = _extract_sampleset_info(sampleset)
    # ------------------------- Added code

    return {
        "instance": instance_data,
        "solution": solution_data,
        "sampleset": sampleset_data,  # Added code
        "parameter": datastore.parameters,
        "metadata": datastore.meta_data,
    }

This adds a method to convert SampleSet to pandas.DataFrame.

Summary#

Appropriately write test code under tests/ in between the above implementations. With the above implementation, you can log SampleSet to Experiment as follows:


experiment = Experiment()
with experiment.run():
    experiment.log_sampleset("sampleset", sampleset)
experiment.get_run_table()

Contributors can add new data schemas in this way.