The Great Scrape: AI has quietly been scraping your data for years

Artificial intelligence (AI) companies are extensively collecting publicly available internet data, a process known as data scraping, to train their models. This encompasses a wide range of user-generated content, including social media posts, photos, comments, location data, and click patterns. Such data is utilized to develop capabilities like facial recognition, language models that mimic human writing styles, movement algorithms, and detailed psychological profiles.

Recent research highlights the significant scale of this practice, with studies indicating that nearly all leading AI models have been trained using scraped data from the open web and social media users. Despite the widespread use of their content, a substantial majority of social media users, estimated at over 75%, remain unaware that their public posts contribute to AI model training.

The implications of this data collection extend to personal privacy and digital identity. Experts suggest that AI scraping fundamentally alters digital privacy by eliminating "privacy by obscurity," transforming indexed posts into permanent data points for behavioral prediction or likeness recreation. This can contribute to "Identity Syntheticism," where scraped voices and faces might be used to create deepfakes and facilitate social engineering. Platforms' terms of service often grant them licenses to use user content for purposes including AI model development, as seen with companies like Meta and X (formerly Twitter). One analysis from 2025 found that nearly all major social media platforms were using user content for AI training by default.

While major platforms collect extensive user data, they also offer mechanisms for users to limit its use. Users are increasingly encouraged to strengthen their digital hygiene to manage their online footprint.

Key actions individuals can take include:

* **Adjusting Privacy Settings:** Setting all social media accounts to private, disabling facial recognition features, and refraining from posting real-time location data.

* **Managing App Permissions:** Regularly reviewing and revoking camera, microphone, and contact access for unused applications.

* **Controlling Online Footprint:** Utilizing tools to remove personal information from search results, using alias email addresses to prevent cross-platform identity linking, and declining analytics cookies.

* **Platform-Specific Measures:** Users can submit objection forms on platforms like Facebook to limit AI training data use, protect their accounts on X to prevent new data from feeding AI, and pause activity tracking on Google services. LinkedIn also offers full opt-out options for generative AI improvement.

* **Operating System Settings:** On iOS, users can disable cross-app tracking, limit location services, restrict photo access, and turn off personalized ads and analytics sharing. Android users can delete advertising IDs, pause web and app activity, and disable Gemini AI activity and location history.

* **Advanced Protection:** Some experts recommend "data poisoning" tools like Glaze or Nightshade, which add invisible noise to images to confuse AI models without affecting human viewing.

Experts emphasize that keeping social media accounts private is a crucial first step, particularly for platforms like Facebook and Instagram. They also advise caution regarding "sharenting," the practice of parents sharing images of their children online, due to the potential for a massive, unchosen digital footprint that could lead to impersonation or abuse in the future. The widespread nature of AI data scraping underscores the evolving landscape of digital privacy and the need for greater user awareness and proactive management of online data.