Saturday, April 6, 2024

g-f(2)2191 The AI Gold Rush: Tech Giants' Ruthless Pursuit of Data

 


genioux Fact post by Fernando Machuca and Claude



Introduction:


As the race to develop advanced artificial intelligence (AI) systems intensifies, tech giants like OpenAI, Google, and Meta are pushing the boundaries of data collection and usage. The New York Times article "How Tech Giants Cut Corners to Harvest Data for A.I." sheds light on the questionable practices these companies have engaged in to acquire the massive amounts of data needed to train their AI models, including ignoring corporate policies, altering their own rules, and discussing ways to skirt copyright law.



genioux GK Nugget:


"The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology, leading tech companies to cut corners, ignore corporate policies, and debate bending the law." — The New York Times, April 6, 2024



genioux Foundational Fact:


The rapid advancement of AI technology has led to an insatiable demand for data to train increasingly powerful models. As the available high-quality data on the internet dwindles, tech giants are resorting to controversial methods to acquire the necessary information, such as transcribing YouTube videos without permission, considering the purchase of publishing houses for their content, and debating the use of copyrighted material without proper licensing.



The 10 most relevant genioux Facts:


  1. OpenAI transcribed more than one million hours of YouTube videos to train its GPT-4 model, potentially violating YouTube's rules and copyright law.
  2. Google also transcribed YouTube videos to harvest text for its AI models, potentially infringing on creators' copyrights.
  3. Meta executives discussed buying the publishing house Simon & Schuster to procure long works for AI training and debated using copyrighted material without permission.
  4. Tech companies could run through the high-quality data on the internet as soon as 2026, according to Epoch, a research institute.
  5. The volume of data used to train AI models has increased exponentially since the publication of a groundbreaking paper by Jared Kaplan in 2020, which emphasized the importance of data scale.
  6. Google broadened its terms of service to allow the use of publicly available Google Docs, restaurant reviews on Google Maps, and other online material for its AI products.
  7. The growing use of creative works by AI companies has prompted lawsuits over copyright and licensing, with creators arguing that their content is being used without permission or payment.
  8. Tech companies are exploring the use of "synthetic" data, which is generated by AI models themselves, to reduce their dependence on copyrighted material.
  9. The ethical concerns surrounding the use of intellectual property without fair compensation for authors and artists have been raised but often overlooked by tech executives.
  10. The race to acquire data for AI training has led to a debate about the application of fair use doctrine in the context of AI development.



Conclusion:


The revelations in "How Tech Giants Cut Corners to Harvest Data for A.I." underscore the urgent need for a more comprehensive and ethical approach to data collection and usage in the development of AI technologies. As the demand for data continues to grow, it is crucial that tech companies, policymakers, and creators work together to establish clear guidelines and regulations that protect intellectual property rights while fostering innovation. The current practices of ignoring corporate policies, altering rules, and skirting copyright law are unsustainable and risk undermining public trust in AI and the companies developing these powerful technologies. As we move forward, it is essential that the AI industry prioritizes transparency, accountability, and respect for creators' rights to ensure that the benefits of AI are realized in a fair and equitable manner.



REFERENCE

The g-f GK Article




Classical Summary:


In "How Tech Giants Cut Corners to Harvest Data for A.I.," The New York Times exposes the questionable practices employed by leading technology companies such as OpenAI, Google, and Meta in their relentless pursuit of data to train advanced artificial intelligence (AI) systems. As the race to develop cutting-edge AI intensifies, these companies have resorted to ignoring corporate policies, altering their own rules, and even considering ways to circumvent copyright law to acquire the massive amounts of data needed to fuel their AI ambitions.


The article reveals that OpenAI transcribed more than one million hours of YouTube videos without permission to train its GPT-4 model, potentially violating both YouTube's rules and copyright law. Similarly, Google engaged in transcribing YouTube videos to harvest text for its AI models, which may infringe on creators' copyrights. Meanwhile, Meta executives discussed the possibility of purchasing the publishing house Simon & Schuster to gain access to long works for AI training and debated using copyrighted material without proper licensing.


The increasing demand for data is driven by the exponential growth in the volume of information required to train AI models, as highlighted by Jared Kaplan's groundbreaking paper in 2020. With tech companies projected to exhaust the available high-quality data on the internet as early as 2026, they are turning to controversial methods to secure the necessary information.


Google, for example, broadened its terms of service to allow the use of publicly available Google Docs, restaurant reviews on Google Maps, and other online material for its AI products. This growing use of creative works by AI companies has led to lawsuits over copyright and licensing, with creators arguing that their content is being used without permission or fair compensation.


To reduce their reliance on copyrighted material, tech companies are exploring the use of "synthetic" data generated by AI models themselves. However, the ethical concerns surrounding the use of intellectual property without proper attribution or compensation have often been overlooked by tech executives in their pursuit of AI dominance.


The article underscores the urgent need for a more ethical and regulated approach to data collection and usage in AI development. As the AI industry continues to grow, it is crucial that tech companies, policymakers, and creators collaborate to establish clear guidelines that protect intellectual property rights while fostering innovation. The current practices of ignoring corporate policies, altering rules, and skirting copyright law are unsustainable and risk eroding public trust in AI and the companies behind these powerful technologies. Moving forward, the AI industry must prioritize transparency, accountability, and respect for creators' rights to ensure that the benefits of AI are realized in a fair and equitable manner.



Cade Metz


Cade Metz is a technology correspondent for The New York Times, based in their San Francisco Bureau⁵. He specializes in covering artificial intelligence, driverless cars, robotics, virtual reality, and other emerging technologies¹⁴. 


Metz has been covering technology for over 30 years at various publications including The Times and Wired Magazine¹². He majored in English literature in college and also studied math and computer science¹². His father was a computer programmer¹².


He is the author of “Genius Makers: The Mavericks Who Brought A.I. to Google, Facebook, and The World,” a book that tells the story of the people, ideas, and companies behind the rapid rise of artificial intelligence¹².


As a journalist, Metz's goal is to tell the truth and help people understand the world. He ensures that each sentence is true, and that each story gives the full picture of what is happening¹².


Source: Conversation with Bing, 4/7/2024

(1) Cade Metz, Duke English Alum, Technology Correspondent. https://english.duke.edu/news/cade-metz-duke-english-alum-technology-correspondent.

(2) Cade Metz - The New York Times. https://www.nytimes.com/by/cade-metz.

(3) Artificial Intelligence: Changing Our World - nytimes.com. https://timesevents.nytimes.com/AI.

(4) Cade Metz - Page 2 - The New York Times. https://www.nytimes.com/by/cade-metz?page=2.

(5) Cade Metz - Journalist Profile - Intelligent Relations. https://intelligentrelations.com/journalist/cade-metz/.



Cecilia Kang


Cecilia Kang is a national technology correspondent for The New York Times, based in San Francisco¹². She covers issues at the intersection of technology, policy, and politics¹². Her areas of focus include the regulation of artificial intelligence, federal action against tech giants for antitrust and consumer abuses, and the tech war between the U.S. and China¹².


Kang has been writing about technology for about two decades¹². Prior to joining The Times, she was the senior technology reporter at The Washington Post and also covered technology for the San Jose Mercury News¹².


She co-authored “An Ugly Truth: Inside Facebook’s Battle For Domination,” published in 2021, with her colleague, Sheera Frenkel¹². The book provides an inside look into Facebook's strategies and challenges.


Kang's work has been recognized with several awards, including the George Polk and Loeb awards¹². She adheres to the standards of integrity outlined in The Times’s Ethical Journalism handbook¹².


Source: Conversation with Bing, 4/7/2024

(1) Cecilia Kang - The New York Times. https://www.nytimes.com/by/cecilia-kang.

(2) Cecilia Kang - Page 2 - The New York Times. https://www.nytimes.com/by/cecilia-kang?page=2.

(3) Cecilia Kang | Meridian International Center. https://www.meridian.org/profile/cecilia-kang/.

(4) Cecilia Kang - The Aspen Institute. https://www.aspeninstitute.org/people/cecilia-kang/.



Sheera Frenkel


Sheera Frenkel is a reporter for The New York Times, based in the San Francisco Bay Area¹. She covers social media companies, including Facebook, Instagram, Twitter, TikTok, YouTube, Telegram, and WhatsApp¹. Her work focuses on the dynamics within these companies, how their executives make decisions, and how these decisions impact billions of people worldwide¹.


Frenkel has covered technology for almost a decade, reporting on topics ranging from the first documented cases of cyberwarfare in the Middle East to the social media manipulations that are rampant in today's internet landscape¹. 


In 2021, she co-authored “An Ugly Truth: Inside Facebook’s Battle for Domination” with her Times colleague, Cecilia Kang¹. The book, a New York Times and International best seller, was based on reporting she did for The Times on how Facebook failed to protect people’s data and understand the manipulations happening on its platform¹.


Before joining The Times, Frenkel covered cybersecurity for Buzzfeed and was a foreign correspondent based in the Middle East from 2005 to 2015¹³. Her work has been recognized with several awards, including being a finalist for the Pulitzer Prize for national reporting¹.


Source: Conversation with Bing, 4/7/2024

(1) Sheera Frenkel - The New York Times. https://www.nytimes.com/by/sheera-frenkel.

(2) Oakland author of Facebook exposé cut her teeth in the Mideast - J.. https://jweekly.com/2021/08/02/oakland-reporter-and-author-of-facebook-expose-cut-her-teeth-in-the-mideast/.

(3) Sheera Frenkel - Page 7 - The New York Times. https://www.nytimes.com/by/sheera-frenkel?page=7.



Stuart A. Thompson


Stuart A. Thompson is a reporter for The New York Times, where he covers how false and misleading information spreads online and how it affects people around the world¹. He focuses on misinformation, disinformation, and other misleading content¹⁵. 


Thompson has written for newspapers his entire career, starting in Canada at The Globe and Mail¹. He has a background in visual journalism, and many of his stories have interactive or visual components¹. He was previously the graphics director for The Wall Street Journal, where he managed a team of 30 journalists focused on print and online visual stories¹. 


He joined The New York Times in 2014 to create and run the Opinion department’s first visual journalism team¹³. He then joined The Times’s newsroom in 2021 to report on the spread of false and misleading information¹³. 


Thompson was part of a team that won a Pulitzer Prize at The Wall Street Journal for “Medicare Unmasked”¹³. He was a Pulitzer finalist in 2018 for an editorial package in the Opinion department about domestic violence¹. He was also a Livingston Award finalist in 2020 for his series on privacy called “One Nation, Tracked”¹.


Source: Conversation with Bing, 4/7/2024

(1) Stuart A. Thompson - The New York Times. https://www.nytimes.com/by/stuart-a-thompson.

(2) Stuart A. Thompson - Page 2 - The New York Times. https://www.nytimes.com/by/stuart-a-thompson?page=2.

(3) Stuart A. Thompson - The New York Times. https://bing.com/search?q=Stuart+A.+Thompson+NYT+summary.

(4) Stuart Thompson Joins Business | The New York Times Company. https://www.nytco.com/press/stuart-thompson-joins-business/.

(5) Stuart A. Thompson. https://stuartathompson.com/.



Nico Grant


Nico Grant is a technology reporter for The New York Times, based in San Francisco¹. He covers Google and has a background in reporting on cloud computing and hardware companies¹. Before joining The New York Times, he spent five years at Bloomberg News¹. He is based in Oakland, California, and grew up in New York City¹. He attended the Craig Newmark Graduate School of Journalism¹.


Source: Conversation with Bing, 4/7/2024

(1) Nico Grant - The New York Times. https://www.nytimes.com/by/nico-grant.

(2) Nico Grant - Page 3 - The New York Times. https://www.nytimes.com/by/nico-grant?page=3.

(3) Nico Grant - Technology Reporter @ The New York Times - Crunchbase. https://www.crunchbase.com/person/nico-grant.

(4) Nico Grant - Page 2 - The New York Times. https://www.nytimes.com/by/nico-grant?page=2.



The categorization and citation of the genioux Fact post


Categorization


This genioux Fact post is classified as Breaking Knowledge which means: Insights for comprehending the forces molding our world and making sense of news and trends.



Type: Breaking Knowledge, Free Speech



g-f Lighthouse of the Big Picture of the Digital Age [g-f(2)1813g-f(2)1814]


Angel sponsors                  Monthly sponsors



g-f(2)2191: The Juice of Golden Knowledge




GK Juices or Golden Knowledge Elixirs


References


genioux facts”: The online program on "MASTERING THE BIG PICTURE OF THE DIGITAL AGE”, g-f(2)2191, Fernando Machuca and ClaudeApril 6, 2024, Genioux.com Corporation.


The genioux facts program has established a robust foundation of over 2190 Big Picture of the Digital Age posts [g-f(2)1 - g-f(2)2190].



List of Most Recent genioux Fact Posts


genioux GK Nugget of the Day


"genioux facts" presents daily the list of the most recent "genioux Fact posts" for your self-service. You take the blocks of Golden Knowledge (g-f GK) that suit you to build custom blocks that allow you to achieve your greatness. — Fernando Machuca and Bard (Gemini)



March 2024

g-f(2)2166 Unlock Your Greatness: Today's Daily Dose of g-f Golden Knowledge (March 2024)


February 2024

g-f(2)1938 Unlock Your Greatness: Today's Daily Dose of g-f Golden Knowledge (February 2024)


January 2024

g-f(2)1937 Unlock Your Greatness: Today's Daily Dose of g-f Golden Knowledge (January 2024)


Recent 2023

g-f(2)1936 Unlock Your Greatness: Today's Daily Dose of g-f Golden Knowledge (2023)


Featured "genioux fact"

g-f(2)2365 Building Blocks of Transformation: An Insight into g-f GK

  genioux Fact post by  Fernando Machuca  and   Copilot The genioux Foundational Fact of the Day (5/11/2024) Golden Knowledge ( g-f GK ) is ...

Popular genioux facts, Last 30 days