Execute¶
Once your ReFedEz project is configured and started (as described in Start), you can execute federated learning jobs. This section uses the CIFAR-10 example to demonstrate the execution process.
Prerequisites¶
Before running the example:
-
Install Dependencies: Ensure all required packages are installed. From the project directory:
Or with pip: -
Prepare Directories: Create the necessary directories for datasets and models:
-
Download Dataset: Download the CIFAR-10 dataset to the expected location (on each client):
# Using torchvision (will be done automatically by the script, but ensure it's available) python -c "import torchvision; torchvision.datasets.CIFAR10(root='/ds/cifar10', train=True, download=True)" python -c "import torchvision; torchvision.datasets.CIFAR10(root='/ds/cifar10', train=False, download=True)"
Running the Federated Learning Job¶
The CIFAR-10 example uses a PyTorch-based federated learning implementation. The model.py file contains a CIFAR10Federated class decorated with @Federated, which automatically handles the distributed training across the configured server and clients.
To start the federated training:
This command:
- Initializes the federated learning process using the configuration in refedez.yaml
- Loads the CIFAR-10 dataset from /ds/cifar10
- Trains a CNN model across the distributed clients (site1 and site2) and server (server.localhost)
- Aggregates model updates using federated averaging
- Saves trained models to /models/test.pl on the server
Expected Output¶
- Models: Trained model checkpoints saved to
/models/
After completion, you can stop the deployment:
And clean up temporary files: