I think his argument is that the models initially needed lots of data to verify and validate their current operation. Subsequent advances may have allowed those models to be created cleanly, but those advances relied on tainted data, thus making the advances themselves tainted.
I’m not sure I agree with that argument. It’s like saying that if you invented a cure for cancer that relied on morally bankrupt means you shouldn’t use that cure. I’d say that there should be a legal process involved against the person who did the illegal acts but once you have discovered something it stands on its own two feet. Perhaps there should be some kind of reparations however given to the people who were abused in that process.
I don’t think that’s what they are saying. It’s not that you can’t now, it’s that initially people did need to use a lot of data. Then they found tricks to improve training on less, but these tricks came about after people saw what was possible. Since they initially needed such data, their argument goes, and we wouldn’t have been able to improve upon the techniques if we didn’t know that huge neutral nets trained by lots of data were effective, then subsequent models are tainted by the original sin of requiring all this data.
As I said above, I don’t think that subsequent models are necessarily tainted, but I find it hard to argue with the fact that the original models did use data they shouldn’t have and that without it we wouldn’t be where we are today. Which seems unfair to the uncompensated humans who produced the data set.