Execute¶

Once your ReFedEz project is configured and started (as described in Start), you can execute federated learning jobs. This section uses the CIFAR-10 example to demonstrate the execution process.

Prerequisites¶

Before running the example:

Install Dependencies: Ensure all required packages are installed. From the project directory:
```
uv sync
```
Or with pip:
```
pip install -e .
```
Prepare Directories: Create the necessary directories for datasets and models:
```
mkdir -p /ds
mkdir -p /models
```

Download Dataset: Download the CIFAR-10 dataset to the expected location (on each client):

# Using torchvision (will be done automatically by the script, but ensure it's available)
python -c "import torchvision; torchvision.datasets.CIFAR10(root='/ds/cifar10', train=True, download=True)"
python -c "import torchvision; torchvision.datasets.CIFAR10(root='/ds/cifar10', train=False, download=True)"

Running the Federated Learning Job¶

The CIFAR-10 example uses a PyTorch-based federated learning implementation. The model.py file contains a CIFAR10Federated class decorated with @Federated, which automatically handles the distributed training across the configured server and clients.

To start the federated training:

python model.py

This command: - Initializes the federated learning process using the configuration in refedez.yaml - Loads the CIFAR-10 dataset from /ds/cifar10 - Trains a CNN model across the distributed clients (site1 and site2) and server (server.localhost) - Aggregates model updates using federated averaging - Saves trained models to /models/test.pl on the server

Expected Output¶

Models: Trained model checkpoints saved to /models/

After completion, you can stop the deployment:

refedez stop

And clean up temporary files:

refedez clean