self training with noisy student improves imagenet classification

In our experiments, we use dropout[63], stochastic depth[29], data augmentation[14] to noise the student. The performance consistently drops with noise function removed. Self-training with Noisy Student. Imaging, 39 (11) (2020), pp. Self-Training With Noisy Student Improves ImageNet Classification Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. Then we finetune the model with a larger resolution for 1.5 epochs on unaugmented labeled images. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. w Summary of key results compared to previous state-of-the-art models. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Are you sure you want to create this branch? EfficientNet-L1 approximately doubles the training time of EfficientNet-L0. https://arxiv.org/abs/1911.04252. Our study shows that using unlabeled data improves accuracy and general robustness. Flip probability is the probability that the model changes top-1 prediction for different perturbations. Amongst other components, Noisy Student implements Self-Training in the context of Semi-Supervised Learning. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. We iterate this process by putting back the student as the teacher. Finally, we iterate the algorithm a few times by treating the student as a teacher to generate new pseudo labels and train a new student. Self-Training With Noisy Student Improves ImageNet Classification Abstract: We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. The main use case of knowledge distillation is model compression by making the student model smaller. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. We determine number of training steps and the learning rate schedule by the batch size for labeled images. Please Compared to consistency training[45, 5, 74], the self-training / teacher-student framework is better suited for ImageNet because we can train a good teacher on ImageNet using label data. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. This paper reviews the state-of-the-art in both the field of CNNs for image classification and object detection and Autonomous Driving Systems (ADSs) in a synergetic way including a comprehensive trade-off analysis from a human-machine perspective. Finally, for classes that have less than 130K images, we duplicate some images at random so that each class can have 130K images. Using self-training with Noisy Student, together with 300M unlabeled images, we improve EfficientNets[69] ImageNet top-1 accuracy to 87.4%. As a comparison, our method only requires 300M unlabeled images, which is perhaps more easy to collect. Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le. Train a larger classifier on the combined set, adding noise (noisy student). Models are available at this https URL. Self-training was previously used to improve ResNet-50 from 76.4% to 81.2% top-1 accuracy[76] which is still far from the state-of-the-art accuracy. This paper proposes a pipeline, based on a teacher/student paradigm, that leverages a large collection of unlabelled images to improve the performance for a given target architecture, like ResNet-50 or ResNext. Semi-supervised medical image classification with relation-driven self-ensembling model. In particular, we set the survival probability in stochastic depth to 0.8 for the final layer and follow the linear decay rule for other layers. This result is also a new state-of-the-art and 1% better than the previous best method that used an order of magnitude more weakly labeled data [ 44, 71]. Noisy Student Training is a semi-supervised learning approach. We then train a student model which minimizes the combined cross entropy loss on both labeled images and unlabeled images. We found that self-training is a simple and effective algorithm to leverage unlabeled data at scale. Whether the model benefits from more unlabeled data depends on the capacity of the model since a small model can easily saturate, while a larger model can benefit from more data. Proceedings of the eleventh annual conference on Computational learning theory, Proceedings of the IEEE conference on computer vision and pattern recognition, Empirical Methods in Natural Language Processing (EMNLP), Imagenet classification with deep convolutional neural networks, Domain adaptive transfer learning with specialist models, Thirty-Second AAAI Conference on Artificial Intelligence, Regularized evolution for image classifier architecture search, Inception-v4, inception-resnet and the impact of residual connections on learning. . If nothing happens, download GitHub Desktop and try again. Le. Self-training with Noisy Student improves ImageNet classificationCVPR2020, Codehttps://github.com/google-research/noisystudent, Self-training, 1, 2Self-training, Self-trainingGoogleNoisy Student, Noisy Studentstudent modeldropout, stochastic depth andaugmentationteacher modelNoisy Noisy Student, Noisy Student, 1, JFT3ImageNetEfficientNet-B00.3130K130K, EfficientNetbaseline modelsEfficientNetresnet, EfficientNet-B7EfficientNet-L0L1L2, batchsize = 2048 51210242048EfficientNet-B4EfficientNet-L0l1L2350epoch700epoch, 2EfficientNet-B7EfficientNet-L0, 3EfficientNet-L0EfficientNet-L1L0, 4EfficientNet-L1EfficientNet-L2, student modelNoisy, noisystudent modelteacher modelNoisy, Noisy, Self-trainingaugmentationdropoutstochastic depth, Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores., 12/self-training-with-noisy-student-f33640edbab2, EfficientNet-L0EfficientNet-B7B7, EfficientNet-L1EfficientNet-L0, EfficientNetsEfficientNet-L1EfficientNet-L2EfficientNet-L2EfficientNet-B75. EfficientNet with Noisy Student produces correct top-1 predictions (shown in. This work adopts the noisy-student learning method, and adopts 3D nnUNet as the segmentation model during the experiments, since No new U-Net is the state-of-the-art medical image segmentation method and designs task-specific pipelines for different tasks. Self-Training Noisy Student " " Self-Training . Further, Noisy Student outperforms the state-of-the-art accuracy of 86.4% by FixRes ResNeXt-101 WSL[44, 71] that requires 3.5 Billion Instagram images labeled with tags. As stated earlier, we hypothesize that noising the student is needed so that it does not merely learn the teachers knowledge. Self-training with noisy student improves imagenet classification. We first improved the accuracy of EfficientNet-B7 using EfficientNet-B7 as both the teacher and the student. on ImageNet ReaL We use the labeled images to train a teacher model using the standard cross entropy loss. Code is available at https://github.com/google-research/noisystudent. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. The biggest gain is observed on ImageNet-A: our method achieves 3.5x higher accuracy on ImageNet-A, going from 16.6% of the previous state-of-the-art to 74.2% top-1 accuracy. We then use the teacher model to generate pseudo labels on unlabeled images. Ranked #14 on This work introduces two challenging datasets that reliably cause machine learning model performance to substantially degrade and curates an adversarial out-of-distribution detection dataset called IMAGENET-O, which is the first out- of-dist distribution detection dataset created for ImageNet models. Image Classification During the generation of the pseudo labels, the teacher is not noised so that the pseudo labels are as accurate as possible. We use EfficientNets[69] as our baseline models because they provide better capacity for more data. At the top-left image, the model without Noisy Student ignores the sea lions and mistakenly recognizes a buoy as a lighthouse, while the model with Noisy Student can recognize the sea lions. on ImageNet, which is 1.0 Noisy Student improves adversarial robustness against an FGSM attack though the model is not optimized for adversarial robustness. For this purpose, we use a much larger corpus of unlabeled images, where some images may not belong to any category in ImageNet. For example, with all noise removed, the accuracy drops from 84.9% to 84.3% in the case with 130M unlabeled images and drops from 83.9% to 83.2% in the case with 1.3M unlabeled images. Self-training with Noisy Student improves ImageNet classification Abstract. For a small student model, using our best model Noisy Student (EfficientNet-L2) as the teacher model leads to more improvements than using the same model as the teacher, which shows that it is helpful to push the performance with our method when small models are needed for deployment. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. to use Codespaces. Finally, in the above, we say that the pseudo labels can be soft or hard. Probably due to the same reason, at =16, EfficientNet-L2 achieves an accuracy of 1.1% under a stronger attack PGD with 10 iterations[43], which is far from the SOTA results. In addition to improving state-of-the-art results, we conduct additional experiments to verify if Noisy Student can benefit other EfficienetNet models. Are labels required for improving adversarial robustness? Copyright and all rights therein are retained by authors or by other copyright holders. The top-1 accuracy of prior methods are computed from their reported corruption error on each corruption. Qizhe Xie, Eduard Hovy, Minh-Thang Luong, Quoc V. Le. Use a model to predict pseudo-labels on the filtered data: This is not an officially supported Google product. Work fast with our official CLI. Notably, EfficientNet-B7 achieves an accuracy of 86.8%, which is 1.8% better than the supervised model. Noisy Student Training is based on the self-training framework and trained with 4 simple steps: For ImageNet checkpoints trained by Noisy Student Training, please refer to the EfficientNet github. over the JFT dataset to predict a label for each image. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Noisy StudentImageNetEfficientNet-L2state-of-the-art. on ImageNet ReaL. (Submitted on 11 Nov 2019) We present a simple self-training method that achieves 87.4% top-1 accuracy on ImageNet, which is 1.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. First, we run an EfficientNet-B0 trained on ImageNet[69]. Self-Training With Noisy Student Improves ImageNet Classification Abstract: We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. For simplicity, we experiment with using 1128,164,132,116,14 of the whole data by uniformly sampling images from the the unlabeled set though taking the images with highest confidence leads to better results. Due to the large model size, the training time of EfficientNet-L2 is approximately five times the training time of EfficientNet-B7. The algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student. If nothing happens, download GitHub Desktop and try again. As can be seen, our model with Noisy Student makes correct and consistent predictions as images undergone different perturbations while the model without Noisy Student flips predictions frequently. labels, the teacher is not noised so that the pseudo labels are as good as This way, we can isolate the influence of noising on unlabeled images from the influence of preventing overfitting for labeled images. We use the same architecture for the teacher and the student and do not perform iterative training. Code is available at this https URL.Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. LeLinks:YouTube: https://www.youtube.com/c/yannickilcherTwitter: https://twitter.com/ykilcherDiscord: https://discord.gg/4H8xxDFBitChute: https://www.bitchute.com/channel/yannic-kilcherMinds: https://www.minds.com/ykilcherParler: https://parler.com/profile/YannicKilcherLinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/If you want to support me, the best thing to do is to share out the content :)If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcherPatreon: https://www.patreon.com/yannickilcherBitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cqEthereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9mMonero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n The top-1 accuracy is simply the average top-1 accuracy for all corruptions and all severity degrees. Please refer to [24] for details about mCE and AlexNets error rate. In other words, using Noisy Student makes a much larger impact to the accuracy than changing the architecture. We improved it by adding noise to the student to learn beyond the teachers knowledge. To intuitively understand the significant improvements on the three robustness benchmarks, we show several images in Figure2 where the predictions of the standard model are incorrect and the predictions of the Noisy Student model are correct. Noisy Student Training is a semi-supervised training method which achieves 88.4% top-1 accuracy on ImageNet and surprising gains on robustness and adversarial benchmarks. Note that these adversarial robustness results are not directly comparable to prior works since we use a large input resolution of 800x800 and adversarial vulnerability can scale with the input dimension[17, 20, 19, 61]. Scaling width and resolution by c leads to c2 times training time and scaling depth by c leads to c times training time. The abundance of data on the internet is vast. The main difference between Data Distillation and our method is that we use the noise to weaken the student, which is the opposite of their approach of strengthening the teacher by ensembling. "Self-training with Noisy Student improves ImageNet classification" pytorch implementation. On robustness test sets, it improves Specifically, we train the student model for 350 epochs for models larger than EfficientNet-B4, including EfficientNet-L0, L1 and L2 and train the student model for 700 epochs for smaller models. Self-Training achieved the state-of-the-art in ImageNet classification within the framework of Noisy Student [1]. Since a teacher models confidence on an image can be a good indicator of whether it is an out-of-domain image, we consider the high-confidence images as in-domain images and the low-confidence images as out-of-domain images. Noisy Student can still improve the accuracy to 1.6%. Lastly, we will show the results of benchmarking our model on robustness datasets such as ImageNet-A, C and P and adversarial robustness. Work fast with our official CLI. But during the learning of the student, we inject noise such as data Are you sure you want to create this branch? However, manually annotating organs from CT scans is time . Here we use unlabeled images to improve the state-of-the-art ImageNet accuracy and show that the accuracy gain has an outsized impact on robustness. The main difference between our method and knowledge distillation is that knowledge distillation does not consider unlabeled data and does not aim to improve the student model. If nothing happens, download Xcode and try again. all 12, Image Classification To noise the student, we use dropout[63], data augmentation[14] and stochastic depth[29] during its training. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Self-training with Noisy Student improves ImageNet classification. We use the standard augmentation instead of RandAugment in this experiment. . We present a simple self-training method that achieves 87.4 Noisy Student Training is based on the self-training framework and trained with 4 simple steps: Train a classifier on labeled data (teacher). This way, the pseudo labels are as good as possible, and the noised student is forced to learn harder from the pseudo labels. Then by using the improved B7 model as the teacher, we trained an EfficientNet-L0 student model. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. Then, that teacher is used to label the unlabeled data. This accuracy is 1.0% better than the previous state-of-the-art ImageNet accuracy which requires 3.5B weakly labeled Instagram images. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. As shown in Figure 1, Noisy Student leads to a consistent improvement of around 0.8% for all model sizes. We call the method self-training with Noisy Student to emphasize the role that noise plays in the method and results. We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Lastly, we follow the idea of compound scaling[69] and scale all dimensions to obtain EfficientNet-L2. Code for Noisy Student Training. [50] used knowledge distillation on unlabeled data to teach a small student model for speech recognition. Specifically, as all classes in ImageNet have a similar number of labeled images, we also need to balance the number of unlabeled images for each class. Also related to our work is Data Distillation[52], which ensembled predictions for an image with different transformations to teach a student network. The accuracy is improved by about 10% in most settings. In this work, we showed that it is possible to use unlabeled images to significantly advance both accuracy and robustness of state-of-the-art ImageNet models. IEEE Trans. Our experiments showed that our model significantly improves accuracy on ImageNet-A, C and P without the need for deliberate data augmentation. corruption error from 45.7 to 31.2, and reduces ImageNet-P mean flip rate from In both cases, we gradually remove augmentation, stochastic depth and dropout for unlabeled images, while keeping them for labeled images. We do not tune these hyperparameters extensively since our method is highly robust to them. On, International journal of molecular sciences. In other words, small changes in the input image can cause large changes to the predictions. [68, 24, 55, 22]. The mapping from the 200 classes to the original ImageNet classes are available online.222https://github.com/hendrycks/natural-adv-examples/blob/master/eval.py. Hence, whether soft pseudo labels or hard pseudo labels work better might need to be determined on a case-by-case basis. The architectures for the student and teacher models can be the same or different. Le, and J. Shlens, Using videos to evaluate image model robustness, Deep residual learning for image recognition, Benchmarking neural network robustness to common corruptions and perturbations, D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, Distilling the knowledge in a neural network, G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, G. Huang, Y. Our main results are shown in Table1. Self-Training With Noisy Student Improves ImageNet Classification @article{Xie2019SelfTrainingWN, title={Self-Training With Noisy Student Improves ImageNet Classification}, author={Qizhe Xie and Eduard H. Hovy and Minh-Thang Luong and Quoc V. Le}, journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2019 . It is expensive and must be done with great care. Self-training with Noisy Student improves ImageNet classication Qizhe Xie 1, Minh-Thang Luong , Eduard Hovy2, Quoc V. Le1 1Google Research, Brain Team, 2Carnegie Mellon University fqizhex, thangluong, [email protected], [email protected] Abstract We present Noisy Student Training, a semi-supervised learning approach that works well even when . This shows that it is helpful to train a large model with high accuracy using Noisy Student when small models are needed for deployment. and surprising gains on robustness and adversarial benchmarks. With Noisy Student, the model correctly predicts dragonfly for the image. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. team using this approach not only surpasses the top-1 ImageNet accuracy of SOTA models by 1%, it also shows that the robustness of a model also improves. A. Alemi, Thirty-First AAAI Conference on Artificial Intelligence, C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, EfficientNet: rethinking model scaling for convolutional neural networks, Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results, H. Touvron, A. Vedaldi, M. Douze, and H. Jgou, Fixing the train-test resolution discrepancy, V. Verma, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), J. Weston, F. Ratle, H. Mobahi, and R. Collobert, Deep learning via semi-supervised embedding, Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le, Unsupervised data augmentation for consistency training, S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, I. Aerial Images Change Detection, Multi-Task Self-Training for Learning General Representations, Self-Training Vision Language BERTs with a Unified Conditional Model, 1Cademy @ Causal News Corpus 2022: Leveraging Self-Training in Causality CLIP (Contrastive Language-Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning.The idea of zero-data learning dates back over a decade [^reference-8] but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. putting back the student as the teacher. The paradigm of pre-training on large supervised datasets and fine-tuning the weights on the target task is revisited, and a simple recipe that is called Big Transfer (BiT) is created, which achieves strong performance on over 20 datasets. The architecture specifications of EfficientNet-L0, L1 and L2 are listed in Table 7. 27.8 to 16.1. We hypothesize that the improvement can be attributed to SGD, which introduces stochasticity into the training process. ImageNet-A test set[25] consists of difficult images that cause significant drops in accuracy to state-of-the-art models. Astrophysical Observatory. Noisy Student leads to significant improvements across all model sizes for EfficientNet. The performance drops when we further reduce it. 10687-10698 Abstract Please refer to [24] for details about mFR and AlexNets flip probability. The score is normalized by AlexNets error rate so that corruptions with different difficulties lead to scores of a similar scale. For more information about the large architectures, please refer to Table7 in Appendix A.1. unlabeled images , . Finally, the training time of EfficientNet-L2 is around 2.72 times the training time of EfficientNet-L1. In the above experiments, iterative training was used to optimize the accuracy of EfficientNet-L2 but here we skip it as it is difficult to use iterative training for many experiments. Our experiments show that an important element for this simple method to work well at scale is that the student model should be noised during its training while the teacher should not be noised during the generation of pseudo labels. We used the version from [47], which filtered the validation set of ImageNet. Noise Self-training with Noisy Student 1. Next, a larger student model is trained on the combination of all data and achieves better performance than the teacher by itself.OUTLINE:0:00 - Intro \u0026 Overview1:05 - Semi-Supervised \u0026 Transfer Learning5:45 - Self-Training \u0026 Knowledge Distillation10:00 - Noisy Student Algorithm Overview20:20 - Noise Methods22:30 - Dataset Balancing25:20 - Results30:15 - Perturbation Robustness34:35 - Ablation Studies39:30 - Conclusion \u0026 CommentsPaper: https://arxiv.org/abs/1911.04252Code: https://github.com/google-research/noisystudentModels: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnetAbstract:We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. The Wilds 2.0 update is presented, which extends 8 of the 10 datasets in the Wilds benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment, and systematically benchmark state-of-the-art methods that leverage unlabeling data, including domain-invariant, self-training, and self-supervised methods. Our work is based on self-training (e.g.,[59, 79, 56]). For smaller models, we set the batch size of unlabeled images to be the same as the batch size of labeled images. We iterate this process by putting back the student as the teacher. Notice, Smithsonian Terms of A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. sign in We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Although noise may appear to be limited and uninteresting, when it is applied to unlabeled data, it has a compound benefit of enforcing local smoothness in the decision function on both labeled and unlabeled data. Semantic Scholar is a free, AI-powered research tool for scientific literature, based at the Allen Institute for AI. Conclusion, Abstract , ImageNet , web-scale extra labeled images weakly labeled Instagram images weakly-supervised learning . [^reference-9] [^reference-10] A critical insight was to . We apply dropout to the final classification layer with a dropout rate of 0.5. For instance, on ImageNet-A, Noisy Student achieves 74.2% top-1 accuracy which is approximately 57% more accurate than the previous state-of-the-art model. We first report the validation set accuracy on the ImageNet 2012 ILSVRC challenge prediction task as commonly done in literature[35, 66, 23, 69] (see also [55]). This result is also a new state-of-the-art and 1% better than the previous best method that used an order of magnitude more weakly labeled data [ 44, 71]. During this process, we kept increasing the size of the student model to improve the performance.