The code that scikit-learn comprises has been committed to by 8529+ people.
-
python/cpython - 2640
-
numpy/numpy - 1581
-
scipy/scipy - 1447
-
joblib/threadpoolctl - 10
-
scikit-learn/scikit-learn - 2837
It is possible that some of these people introduced typo changes, or the code they contributed has been rewritten by someone else. But the repo was affected by them, no less. Their accounts were authenticated to contribute to the history of a notable library in the ML training community. The more valuable ML becomes, the greater the desire to influence its capabilities. Management of this code is split between five different repos across five organizations.
This is only one training library; the ML community uses an ever-growing number of libraries.
What about the drivers used to perform the GPU operations on the data? The operating system. What about the base model or training data from huggingface? The data scientist trains the model on their beefy laptop in other applications. Who has access to them, and what code could they sneak in? What if their account was hacked, and someone else made changes?
We are cresting on a significant surge of value from advancements in ML. Influencing how AI performs could mean changing how significant decisions are made. I illustrated the state of one threat vector in an expanding attack surface.
The security recommendations for securing an ML pipeline are still early. Most of the focus of AI security is on adversarial attacks, often overlooking the process of building a model. It is a cross-disciplinary effort between data scientists, software engineers, and infrastructure.
For such a significant achievement in human advancement, I think it is essential to map out this problem space more and focus on securing this infrastructure.