Internet data collection, privacy, and user tracking have become hot topics, generating plenty of press and concern among consumers, enterprises and regulators alike.
But there’s a glaring hole in the coverage: What’s really going on behind the scenes, beyond the anecdotes of transgressions of technology companies or conflicting signals from industry conferences and government hearings?
Simply put, is the Internet trending towards a Wild West or a civilized society regarding how data collectors, consumers, and digital enterprises interact on websites and mobile apps?
To fill this gap (and celebrate Data Privacy Day), Mezzobit has created the Data Transparency Index: the technology industry’s first scorecard to use comprehensive data to benchmark these important questions on a global basis:
- How much data is being collected?
- How much user tracking is occurring?
- How secure is data collection?
- What else do Internet companies do when their technology is used on websites?
Mezzobit’s business involves monitoring billions of visitor sessions across thousands of websites each month, helping our corporate clients better understand and control data activities to optimize their digital properties. This puts us in a unique position to see what’s happening, something that few understand based on our conversations with thousands of corporate executives, regulators, and consumer advocates. Using our proprietary data and algorithms, we created a master composite index that rolls up five underlying scores that represent the current state of Internet data.
We chose the name Data Transparency Index as a contrast to the opaque and fast-moving nature of the Internet. We hope that greater transparency and accountability can lead to a more informed debate on the balance among consumer privacy rights, the need for digital enterprises to control how their customers are treated, and the use of data by Internet companies to power innovation and economic growth. Read more>
The details of December’s scores
Description: This is the master index, which is an average of all scores into a single number.
Why it’s important:
- This provides an overall view on how Internet companies interact with consumers on websites. Like a stock market index, the value of the number is less important than where it travels month over month. Increases may come from greater levels of data collection and tracking, but offsetting these may be more widespread use of security technology or website operators ejecting business partners as they prune their down the number of tags.
- A higher value in future months means that Internet-wide data collection, tracking and other activities are on the rise.
Description: Whether data is transmitted via secure or insecurity protocols, as well as whether the payload is encrypted.
Why it’s important:
- Websites can use two transmission protocols to load content and send data, which you typically see at the beginning of URL. Hypertext Transfer Protocol (HTTP) sends all data in plain text, and HTTP Secure (HTTPS) uses an encrypted pipeline between the server and the visitor. HTTP is easily monitored, while HTTPS is much more difficult to crack. Websites rely on one of these protocols, and the tags within them also can choose which protocol to use.
- When tags send their data, they also can encrypt the contents in addition to using an encrypted pipeline. This is particularly important when personally identifiable information (PII) such as name, email addresses, and credit card numbers are involved.
- A higher value in future months means that data is more frequently being transmitted using insecure methods.
- The relatively high score of this index is driven by the fact that most websites still use insecure HTTP, although most sites that engage in e-commerce or have a login rely on HTTPS. Currently, 8% of the top million websites use HTTPS, according to our research. Google has begun advantaging HTTPS websites in their search rankings as part of the “HTTPS everywhere” campaign to make the web more secure.
- An Internet best practice for tags to use a secure protocol when operating on sites that use HTTPS, but 7.8% of tags did not follow these guidelines. In total, 72.4% of data transmissions were over insecure HTTP.
Description: Whether the tag collects and transmits data to a remote server, as well as the type of data that’s collected and sent.
Why it’s important:
- Even image files, which aren’t programs, can transmit data to remote servers by embedded element in their URLs. So even those cat pictures can be watching you.
- A vast majority of data collection targets anonymous information that does not include PII, as handling personal information incurs more legal overhead for Internet companies. However, there is no technical limitation keeping PII from being collected; it’s mainly self-imposed by Internet companies. Sometimes, PII can be accidentally collected when it’s used in page URLs.
- If enough anonymous data is collected, it can make the leap back to PII when combined in the right manner to triangulate an individual. A 2000 study found that using anonymous census data elements of ZIP code, birth date, and gender could positively identify 87% of U.S. residents.
- A higher value in future months means that websites and tag providers are collecting more data on average and collecting more specific data.
- The average tag collected and transmitted 5.2 data elements, comprising 143 bytes of data.
Description: Whether a tag tracks the user from one page or site to the next, and the type of tracking method used.
Why it’s important:
- When a tag executes in a visitor’s browser, it can deposit a small file on the local computer. This file, called a cookie, often contains a unique ID (but doesn’t contain executable code like a virus). When the visitor browses to another site on the Internet, a tag from that same company can read the cookie and identify that same user. This permits the company to tie together data collected on multiple sites, as well as construct a browsing history for the user.
- Website operators want to know which tags employ cookies, because this may result in Internet companies siphoning away audience value. For instance, publishers may lose out on opportunities to sell advertising if their visitors can be easily identified and reached on other websites, a practice called retargeting.
- While cookies can be detected by consumers and blocked if they wish, there are other tracking technologies that are harder to subvert. One is a technique called browser fingerprinting which creates a profile of the visitor’s browser and computer on the server, out of reach of both the consumer and website operator. Other techniques include using Flash to store data in places that aren’t easily monitored by the browser.
- A higher value in future months means that more tracking is being done by tags, particularly harder-to-detect tracking such as browser fingerprinting and Flash cookies.
- 77.8% of all tags dropped cookies on visitors.
- 11.5% of all sites analyzed contained tags that engaged in browser fingerprinting.
Description: Whether the tag loads other code into the page from third-party sources. It’s like throwing a party where the guest invite their friends, who invite their friends, and so on.
Why it’s important:
- All of these downstream tags are very difficult to monitor, so the higher the number, the greater the chance that the tags will engage in activities undesirable to the website operator or consumer, such as malware or adware.
- Page load times typically increase as the number of tags rise, leading to a poor user experience.
- A higher value in future months means that the average tag is loading more downstream tags and that there is a greater diversity of companies in those downstream calls (that is, a tag is calling different companies and not just it’s own technology).
- The average site has 32 tags.
- The average tag loaded 1.3 other tags, with the maximum number called from a single tag being 219.
- The longest tag chain observed (one tag calling another calling another) was 13 generations deep.
- Here is a link to our favorite snapshots of tag storms we saw while wandering the Internet (anonymized to avoid embarrassment).
Description: Whether the tag makes any visual changes to the website, ranging from inserting small tracking pixels to large images and video players.
Why it’s important:
- Visitors are mainly indifferent to how web pages are constructed, but website operators aren’t. Certain tags are expected to make changes, such as those that serve up ads or widgets. However, many tags should leave the page relatively untouched, such as analytics tags.
- There are some visual calls that can break other parts of the page, which is undesirable for both the website operator and the visitor.
- Excess UI changes also can slow down site performance, causing visitors to abandon pages more quickly. A delay of just one second can result in a 7% reduction in conversions, such as signing up for an email newsletter or checking out your shopping cart.
- A higher value in future months means that the average tag is making a higher quantity of visual changes to the host website as well as more impactful alterations.
- 66.1% of all third-party elements analyzed were images of some sort.
How we did it
Our study specifically looks at code and images embedded in websites, which are called tags, trackers or beacons. Website operators use tags provided by third parties to power common functions, such as advertising, social sharing, and analytics. Here’s a short article with more information about how they work, but once visitor’s browser loads a tag, it has mostly free reign to collect and transmit data, track the user, and change the website’s content — oftentimes without the knowledge of the originating website operator. And tags can breed like rabbits, with one calling another calling another until hundreds may be on a single page.
Using our proprietary data and algorithms, we created a master composite index that rolls up five underlying scores that represent the current state of Internet data. All scores are uncapped: as the Internet changes in the future months, they can rise without limitation. We also calculate these scores for hundreds of thousands of websites and tags, although their identities are not shared here.
But high scores aren’t necessarily bad, nor are low scores good. It’s like saying 55 mph is too fast or 20 mph is too slow, when each may be just fine for either a highway or school zone. More important is the value that digital enterprises and consumers receive from the Internet by virtue these activities as well as how reality differs from their expectations. Consumers are tiring of slow site performance — coupled with uncertain threats to their privacy — so ad blocking is on the rise. Publishers seek wider sources of revenue, so they invite a greater number of data partners onto their websites.
The scores also are relative to each other, with a higher score representing a greater level of a certain activity, while a lower score meaning the opposite. We plan to publish updates every month to track how the Internet is changing.
In future months, we plan not only to track the progress of each of these indices, but also provide deeper insight into behind-the-scenes statistics as well as segmentation for different types of sites and tags. Additionally, we also are working on new components to the Data Transparency Index that will track other aspects of the relationship among data collectors, enterprises and consumers. If you have any idea or if there are questions you’d like us to answer with this project, please drop us a line.