My team spoke very highly about this blog (and they’re also wondering if self-supervised learning could eliminate the need for labeling entirely) so I gave it a read. It was a very well-written, thorough overview of self-supervised learning. What stands out the most was it was written by Dr. Lecun, one of the people that I respect the most in this field. I imagine his schedule will be brutal but I appreciate that he still finds time to write. Imagine how big of a loss it would be if Yann couldn’t spend time on actual research and writing because of management duties.
My knowledge and understanding of this topic are not comparable to Dr. Lecun’s, but here is my take:
1. Self-supervised learning is a rebranded term, rather than a new method.
In SOCML 2017, in a small room of people, including Bill Dally and @goodfellow_ian, I raised a question.
“Is word2vec supervised, or unsupervised? It is supervised in a sense that we punish wrong predictions during the training but the corpus is actually not labeled. So it’s unsupervised as well.”
No one in the room answered my question saying “it’s called self-supervised learning!” (fair enough, Lecun wasn’t there). Self-supervised learning wasn’t a popular term 3 years ago. I believe LeCun used to call it predictive unsupervised learning. In retrospect, word2vec, BERT, XLM, etc. all fall into the umbrella of self-supervised methods, but when they were published, none of the authors advertised them as such.
[For the people who aren’t familiar with the term ‘self-supervised’ learning]
It’s ‘self’ in a sense that you use your own un-labeled training data for supervision. For example, in language models, you predict the word that comes next given a sentence and compare that prediction with the actual word in corpus.
2.The blog admires NLP’s discreteness for making problems tractable (as NLP has only a finite number of possible predictions) and being very apt for predictive architectures.
However, from my experience, discreteness is a double-edged sword. It's also the very reason that makes the NLP problem (especially generation) “not work well” or “hard to control” compared to CV. For example, NLG tends to output obviously wrong tokens for slight perturbations because of its discrete nature. They are very sensitive. On the other hand, small perturbations…