Knowledge Distillation

  1. Train the Teacher Model: A large, complex model is trained on a large dataset. This model is expected to achieve high accuracy, but it could be computationally expensive and hard to deploy due to its size.
  2. Distillation: The trained teacher model is then used to generate predictions for a dataset (which could be the same training set or a different one). These predictions, which capture the learned knowledge of the teacher model, are used as "soft labels" for training the student model. Soft labels are essentially the class probabilities output by the teacher model, as compared to hard labels which are the actual class labels. The idea is that these soft labels provide more information (in the form of uncertainty or confidence about the predictions) than hard labels, helping the student model to learn better.
  3. Train the Student Model: The smaller student model is then trained to mimic the teacher model's output on this dataset. This can be done by minimizing a loss function that measures the difference between the student's predictions and the teacher's soft labels.

Knowledge distillation can be a good way to create smaller models that perform almost as well as their larger counterparts. In the context of federated learning, it can be used to create smaller models that can be more easily trained and deployed on edge devices, while still benefiting from the knowledge learned by larger models trained on more diverse, distributed data.

For your situation, you can use knowledge distillation as a mechanism to link the performance of the large model (teacher) with the small model (student). If the large model is not well-trained, it will not provide good soft labels for training the student model, resulting in poor performance of the student model as well. This way, companies would be incentivized to train the large model carefully, since it directly impacts the performance of the small model, and thereby their compensation.

Untitled

问题描述:

n companies who has their private data

According to the SV formula, 在联邦学习中我们需要考虑所有 2^n 种数据的组合形式才能评估每个公司/数据集的贡献

However, this cost a lot especially for large model

Assumption:

同样的数据组合 (coalition), 同样的训练流程 (联邦学习) ➡ 小模型的loss和大模型的loss之间存在 scaling law ➡ 我们可以通过在小模型上计算 2^n 个coalition的结果来推测大模型的表现

用大模型猜小模型:

Potential hurdles

| 1 | 大模型的coalition只有一种 (1,1,1…. 即所有方的数据均参与训练) 而非 2^n 个 | | --- | --- | | 2 | 假设有 2^n 种 coalition 对应的大模型 loss, 我们期望通过一到一的映射由每种情况下的大模型结果“推测”出小模型的结果,然后用之前的想法 (通过2^n 个coalition在小模型下训练的结果推测对应的2^n个coalition在大模型下的结果,然后用公式计算SV值) | | 3 | Whatever, 如果用知识蒸馏来做 large → small 的1v1映射,仍然需要用数据训练小模型 (逼近soft label)因此小模型的训练过程仍然是由公司自己控制的 | | 4 | 如果用自己生成的的数据做知识蒸馏,这样得到的小模型是否还满足scaling law? (不满足同样的数据组合(coalition)) |

Possible approaches