Behind AI large model training, a data industry chain is forming

Author: Guo Xiaojing, Tencent Technology

Image source: Generated by Unbounded AI

"Making miracles" and "violent aesthetics", these two words have always appeared with the discussion of ChatGPT. As for "vigorous" and "violent", in addition to "huge computing power", there are also massive amounts of data. Marc Andreessen, the founder of a16z, also pointed out at the Data+AI conference that the massive data accumulated by the Internet over the past two decades is an important reason for the rise of this new wave of AI, because the former provides the latter with data that can be used for training.

According to OpenAI, GPT-3.5 has a text corpus of up to 45TB, which is equivalent to 4.72 million sets of China's four major classics, while GPT-4 adds multi-modal data to the GPT-3 and GPT-3.5 training data sets. On July 18, Meta, the parent company of Facebook, released Llama2, the first open source and commercially available large language model, with pre-training expected to reach 2 trillion tokens.

The ability to obtain massive amounts of high-quality data is regarded as one of the core competitiveness of future large-scale model companies, and it is also a must for the AI arms race of major giants. Data is also seen as a key factor of production that determines future development. According to the statistics of the "Digital China Development Report (2022)", the potential of the digital economy released by data elements will be extremely huge. my country's data output will reach 8.1ZB in 2022, accounting for 10.5% of the world, ranking second in the world. Digital economy Development is at the forefront.

However, data, as a brand-new factor of production, also brings a series of problems that need to be solved urgently: how to understand data? How to confirm data rights? How to mine the value of data? Can it really be traded and circulated? Can data really be included in the financial statements of the enterprise as an asset? How is security managed? To this end, we talked with Professor Zeng Xueyun, Deputy Dean of the Institute of Science and Technology of Beijing University of Posts and Telecommunications**, and asked her to answer relevant questions in depth.

The following is the transcript of the conversation:

**Tencent Technology: Ordinary people may be concerned, where does the data for large model training come from? Is there any use of my personal data, and will there be a problem with the rights of these data? **

**Professor Zeng Xueyun: The data calculated by the **big model is personal data. Compared with corporate data, personal data has a ownership issue. **In principle, I am the master of my data. **For example, the data generated on social software, in principle, the company to which the social software belongs cannot use my personal data. Although these companies have actually controlled the data through default authorization, how to use the specific data is It must be regulated by the "Personal Information Protection Law".

So if it is to be used for large model calculations, how to use it? Technically, it is necessary to carry out anonymization, and in terms of operation, there is also a need for a market entity, which is to **give a certain company a legal right to operate these data, in other words, to give these data Find a market subject. **When the market-oriented subject obtains the data, it needs to invest manpower, time, intelligence, and capital to produce data, which we can all call labor input. After labor input, the data information belonging to the individual is derived into a kind of regenerative data of the company, or secondary data. Then, secondary data generates procedural data, and then to data products and data services. At this time, the original individual data with individuals as data owners is transformed into data products and services for enterprises. This is a productization process.

**Tencent Technology: Is it possible to understand that Internet companies obtain personal data through authorization, and after these companies process the process, they can become some kind of data assets of the company? **

Professor Zeng Xueyun: It can also be understood that we personally generate a large amount of data on the Internet, just like various natural resources in nature. For example, many flowers and trees can grow on the land, and many resources can grow. This kind of resource is a kind of public resource, which can be developed and utilized, but cannot be directly bought or sold. What is generated after utilization and processing is the assets of the enterprise, this is allowed, and we should also encourage the development of data production factors in this way.

**Tencent Technology: From an individual point of view, how to protect our personal data and let them flow in the way we want? **

**Professor Zeng Xueyun: **In the era of artificial intelligence, people's privacy is becoming more and more difficult to protect. Because all the behaviors of people are being recorded, the movement of geographical location, life, work, diet, and daily life are all being recorded. Once recorded, the information that originally belonged to us can no longer be controlled by the perpetrator. Therefore, at this time, the risk of privacy leakage is very high, the task of data protection is also very heavy, and data protection is also very difficult.

How do people protect their data rights? In fact, various countries also have some commercial methods. The first type, like Japan, uses a data bank, that is, everyone can store data in a data bank just like depositing in a bank. The data bank is a custodian of data. It can also serve as an original developer of data value, and individuals can also obtain certain benefits. This means that it allows some people who are willing to disclose and use their own data to a certain extent to have a business model to solve data protection issues in a self-selected way. In other words, constructing legal data circulation, legal data development and utilization models, this is a piece.

**The other part is that I personally don’t want to, so I won’t authorize the data owner. **In the absence of authorization, the country must strengthen data protection. If anyone wants to illegally develop this part of the data, he must be punished and legally supervised. Blockchain technology can be used to track such behaviors. For example, whether our data has been leaked, and where it has been leaked, to track the data flow. It is also possible to track and analyze data kinship, and now there is data kinship technology. Roughly speaking, **Where does the data come from and where does it go? Data lineage analysis is actually a kind of data correlation analysis and data traceability. **Using the word lineage is a very vivid description of the ins and outs of data. Everything is being recorded, so recording other people's data and technology can also be recorded, made public, and penetrated.

my country's "Civil Code" has made special provisions on the protection of personal information in the chapter on personality rights. Article 127 of the "Civil Code" juxtaposes data with network virtual property, highlighting the property attribute of data. In local legislation, the provisions of Article 12 of the "Shanghai Municipal Data Regulations" directly reflect the rights allocation model of "two divisions of human resources and wealth". This article stipulates: "This city protects the personality rights and interests of natural persons on their personal information in accordance with the law." As well as the legal property rights and interests obtained in relevant data innovation activities in the development of the digital economy."

On August 20, 2021, the 30th meeting of the Standing Committee of the Thirteenth National People's Congress voted to pass the "Personal Information Protection Law of the People's Republic of China", which will come into force on November 1, 2021. Details can be found online. The judicial nature of personal information in the "Personal Information Protection Law" is also the protection of personal rights and interests, which hardly involves the property rights and interests of personal information.

**Tencent Technology: What kind of high-quality data is important for large model training? **

**Professor Zeng Xueyun: **Data should be all records of human economic, social, production, management, commercial, and even military activities. Such a record is produced in various industries, fields, and aspects. As far as raw data is concerned, it has high and low quality. For example, the financial statements and financial data of **listed companies are high-quality data, and they are structured data. **Because this kind of financial statements and financial information have been audited by the society and audited by certified public accountants, and the China Securities Regulatory Commission supervises the information disclosure, so it is high-quality data. For another example, the paper data in **CNKI is also high-quality data. **However, the data generated on the Internet is unstructured and non-standardized data. Such data is a kind of original, messy, and unregulated data, which requires granular cleaning before calculation, so high-quality data usually has a transition from unstructured to structured Processing process. **

**Tencent Technology: Since high-quality data can be continuously produced, why is there such a saying that "high-quality data is almost used up"? **

Professor Zeng Xueyun: I think that the ability to produce and process data cannot keep up with people’s demand for data, and the productivity of the entire supply chain value chain for data production and processing is still relatively weak. Because we know that data is constantly exploding, but high-quality data is running out. It just means that in the process from data to high-quality data, we lack a kind of productivity and an ability to integrate. At this time, data providers are needed. Many of our current data providers are only making direct use of data, but for the production and processing of data, and how to produce high-quality data, the capabilities of this area or the design of business models are still not enough. of.

In fact, OpenAI's GPT-4 uses a large amount of data produced by the previous generation model GPT-3.5 for training. The founder of OpenAI also said in a recent interview, "Synthetic data is an effective way to solve the shortage of large model data. The key is that there is a whole system to distinguish which AI-generated data is usable and which is not. And continue to give feedback based on the effect of the trained model.” This company is not just able to raise money, it can control a lot of computing power as simple as that, and the product technology capability of data is also one of the core competitiveness of this company.

**Tencent Technology: In order to improve high-quality data productivity, what are the necessary links in industrial design? **

Professor Zeng Xueyun: About this question, we must first understand what is data? What data do we have? And what to do with the data? That is to say, to produce high-quality data, it does not mean that there is production capacity to have high-quality data, and it does not mean that there is a willingness to produce high-quality data. It must understand data from the source. What problems in society should be solved with data? Where is the market's demand for data? Then, from the original data to the demand side, how should we produce in the middle? This series of problems requires industrial design, and the current overall thinking is not enough.

**Tencent Technology: The immaturity of the industry is one aspect. Does it also mean that the industry is still a blue ocean? **

**Professor Zeng Xueyun: **A very early blue ocean. In the early days, there were some cases of illegal direct trading of data. Later, national legislation could no longer directly buy and sell data itself, and no longer trade raw data. The data cannot be used for original transactions. It should be the result of investing in one’s own production to do transactions, instead of saying that I have some data and I sell the data directly. This is not allowed.

In 2022 (December), the "Twenty Articles of Data" was promulgated. The "Twenty Articles of Data" put forward the requirements for the separation of data ownership, and multi-ownership of data ownership, management rights, and beneficial rights. The division, which mentioned that the data should be managed in this hierarchical category. This is the top-level design of data governance and an overall blueprint. It can also be said that it is the beginning of the standardized development of the future data industry. At this time, people realize that data is not a whole, and they need to understand what rights and interests the data has. This is also the advancement of the original law-based research to economics-based research. ** To establish a data market, the market must be an economic behavior. This kind of economic behavior requires the use of many economic tools and economic theories, so now from the research on data science, the governance of data by the state, to the research on data in academia, and the control of data in the industry Utilization is a blue ocean, and it is a state of just beginning. **

**Tencent Technology: From this point of view, data can exist as a certain asset of an enterprise. What kind of asset does data belong to? **

**Professor Zeng Xueyun:**Data classification is a very hot topic in academia. In most cases, people think that data is intangible, invisible and intangible, and it is called intangible assets. But in fact, from the classification of ITU, data is closer to inventory assets, because data also involves the process of production and processing. And the data itself is an electronic tangible asset, why is it an electronic tangible asset? Data will occupy physical space, and a lot of data itself has a physical form, which is a physical form on the network side. Picture, you can see this electronic picture; sound, you can hear this sound, and portrait, you can see this portrait, so ** data is a digital tangible asset. **

We know that data assets are a very special asset class. Some will suggest that data can be compared to intangible nature for amortization, or analogous to fixed assets for depreciation. In fact, you must first classify the data hierarchically to see which category the data belongs to. **For certain types of data, it also has growability and fusionability. For example, if all the call data of China Unicom can be integrated with personal bank deposit and investment data, a portrait of this person can be generated with more information from investment and financing to his communication and career. At this time, there will be an accumulative effect of data value generated by the fusion of data and data. At this time, data will be fused and growable. There is also a part of data that is indeed time-sensitive, and its value will decay over time. Therefore, we still need to analyze the characteristics of the data itself more specifically in order to know its accounting value, and the accounting of data value has more variability and uncertainty, unlike fixed assets, fixed The asset value at the time of asset formation is certain, and as time goes by, the value gradually decreases, but the data does not necessarily decrease with time, and the data has a more complex asset form.

**Tencent Technology: Is future data one of the core competitiveness of AI companies? Is it possible for data assets to be quantified and reflected in the company's valuation? **

**Professor Zeng Xueyun: **For an artificial intelligence company, **data is its core competitiveness. **For an AI company, the product experience determines the business value of the company, and the data capabilities determine the product experience. **For a country, data is the key competitiveness in the future, and it is also the gold of the future, just as oil is the gold of the industrial age, and **data is the gold of the Internet economy era. **

But at present, countries in the world are actually encountering difficulties in data governance, and no country has taken the lead in making breakthroughs. How to solve the balance between data security, data governance, and data development and utilization. **

In this regard, China has been keenly aware of the importance of data. All countries are also aware that data is a new productivity, but how to use data requires market players, smart technology, and national regulation. Therefore, it is not a simple problem that can be solved, it is a system Complexity issues.

China's national governance is a relatively centralized arrangement from the central to the local, so we naturally have an advantage in integrating big data across the country, but this advantage has not yet been reflected, and it lies in the valuation of ** data There are problems with valuation and valuation, and the problem of data entry into accounting statements has not been resolved. ** There is no good solution to this problem in the world.

**If data can be transferred from off-balance sheet assets to on-balance sheet assets, then the value accounting of data governance and the management of data value can be solved well, and data transactions will have an objective basis. **Now our corporate data are basically off-balance-sheet assets, without valuation, and without measurement and reporting on the balance sheet, so it is not clear how much data the company has, so that the economics of data It is also difficult to make statistics on the value. If the data is not entered into the table, then its transaction will lack a reasonable basis, **so data entry into the table is a key issue. **For the statistics of data volume, the accounting of data prices, and the pricing of data transactions, From volume statistics to price accounting to the basis of transactions, it needs to enter the balance sheet and income statement with data, and enter Accounting for financial statements is an underlying facility. This underlying facility has not been resolved yet.

**Tencent Technology: What are the international precedents for data property rights legislation? **

**Professor Zeng Xueyun: **Research on data property rights legislation. At present, major countries around the world have basic laws on data protection, and they are increasingly clearly positioned to promote the protection of personality rights in data property rights. However, laws and regulations on data utilization are basically missing. Japan has a certain degree of advancement in this regard. my country Considerable emphasis is placed on promoting the circulation of data elements, but without the support, regulation, and guidance of laws and regulations, it mainly relies on administrative documents, which still has a lot of legislative shortcomings. At present, there is an urgent need to innovatively lead the new direction of global legal construction in terms of accelerating the regulation of data property rights and the circulation of data elements. The situation at home and abroad is as follows:

International aspects: The General Data Protection Regulation (GDPR) passed by the European Union in 2016 is currently the most comprehensive and influential data privacy law. The "Regulations" are developing in two directions: strengthening the rights of data subjects, ensuring control over the use of personal data, and taking into account data security and free flow of data. On the basis of confirming and improving the existing rights of individuals, GDPR stipulates the right to delete (Article 17) and the right to portability (Article 20), etc., in order to achieve more effective control of data subjects over their personal data, but the provisions do not There is no clarification on the transfer of personal data ownership and the distribution of property rights.

Although the United States started the system and theoretical exploration of the legal protection of data ownership earlier, most of the relevant norms are scattered in various bills. The legislation of each state is not compatible, but it covers a wide range of areas and has some flexibility in actual dispute resolution to encourage data utilization. For example, the "California Consumer Privacy Act of 2018" issued in 2018 and the "California Privacy Act of 2020" issued in 2020 have increased the determination of data rights, covering the right to access, right to delete, right to know, etc. Consumers' personal privacy rights strengthen the protection of the rights and interests of data subjects during data transfer, which also reflects from the side the United States' permission for the use of data economic value. In 2017, Japan formulated the "Guidelines for Data Usage Rights Contracts". The guidelines fully considered factors such as the contribution of data contracts to data creation, the cost burden of storage and management, and standardized data transaction contracts to promote data transactions. This is a big deal. progress, but there is still no clear definition of data property rights.

In Europe, the EU Charter of Fundamental Rights and the General Data Protection Regulation regard the right to protection of personal data as a special right enjoyed by data subjects, which does not include any property rights. Although EU laws such as the General Data Protection Regulation do not clearly stipulate that data controllers enjoy property rights with data as the object, their data property rights and interests can be protected through database protection, copyright law protection, trade secret protection, contract law protection, and competition law protection. etc. are protected. In addition, the document "Building a European Data Economy" issued by the European Commission is committed to introducing "data producer rights", which endow data controllers with universal property rights over non-personal data and anonymized personal data, enabling them to exclusive use of data, including the right to license others to use such data. In the United States, although some legal scholars believe that individuals should be given property rights to personal information, courts usually do not recognize such property rights. In some cases, U.S. courts have held that companies have property rights in the data they hold. Domestic and foreign legal experience on data property shows that "separation of human resources and wealth" should become the core theoretical proposition for building my country's data property rights system.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)