Crystal Lee (she/她)

Ethan Zuckerman and the "quotidian web"

I've always been a big fan of live-blogging (see, for example, my livetweets of protests at MIT as a grad student), but all of that really comes from a huge intellectual hero of mine, Ethan Zuckerman. (I fondly remember and cherish his 2007 blog post, for example, where he details just how he live-blogs -- a kind of "what's in my bag" video, but for both equipment and process.) I admire Ethan for so many reasons, but one of them is the way that he's able to be so generous in both how he learns and thinks. I've tried to do something similar here on this blog by sharing snippets of talks I've been to -- but today I'll try to do something really daunting to me: to liveblog an OG liveblogger.


Ethan Zuckerman is an associate professor of public policy at the University of Massachusetts-Amherst, and he's here today to talk about the quotidian internet -- what we might learn from mundane, rather than extraordinary, phenomena.

He opens by telling a story about the Goldberger department store in downtown Budapest: it hasn't been a department store for many decades, but is in fact a large archive of materials maintained by the Open Society Foundations to document late stage communism and its downfall. Throughout its collections, you'll find everyday photographs and postcards taken from about 1983-1988, during the very end of socialist Hungary, right before the Berlin Wall comes down. This wealth of photos are really difficult for archivists to put in contemporary context: they're not brilliant photos or postcards, but images of workers goofing off, or about the weird kitsch of 1980s Hungarian interior decor. Many of the photos, in fact, are very lousy -- they're artifacts of a printing error by a film producer. These photos were not intended for public consumption -- in fact, they provide a very valuable mundane, daily view of life in Budapest.

Archivists don't really have any ownership of these photos, as there are big questions of personal data and privacy associated with them. This is also a notoriously difficult archive to work with, as there's no metadata associated with any of the photos (and often, very little context). However, what is the "junk / found amateur photo collection" for some, is actually an extraordinary window in time in Hungary -- where a similar archive might be incredibly valuable just for capturing a unique snapshot in time. No one is really making editorial decisions about what not to keep -- you're just capturing a bunch of data.

We can see a similar example in Pompeii -- where the first scholars started looking for treasure, later scholars started looking at graffiti. This graffiti is really valuable because we start to get a picture for what Latin sounded like as it is spoken, rather than how it is received on high in writing. We don't get the vulgate when we just read scholarly texts in Latin; we want to know how people actually speak. That's what we get out of something like graffiti: the everyday, rather than the extraordinary.

What can scholars learn about the quotidian? What if we could do just that, but for the internet?

In his research now, his team has been able to quantify just how large YouTube actually is: about 14 billion videos. The way that they've been able to do that is by essentially using the "drunk dialing" method: to try to search for videos by randomly guessing the digits and letters required for a unique YouTube video identifier. In other words, his team is able to quantify just how large the YouTube corpus really is by "dialing for videos" -- that is, by calculating what percentage of phones you're dialing at random actually picks up. After you get that batch, you can then look at the metadata of videos that exist. You can find out more in the Journal of Quantitative Description paper.

What is the common view count for all YouTube videos? It's easy to try to think of YouTube as MrBeast and the realm of influencers, but in reality, there's a peak somewhere around 28, the mean is 42, and the relative frequency is a pareto distribution. You can compare four types of videos: Rick Astley's "Never Going to Give You Up," which is at the very tippy-top of views and likes, Ethan's unsuccessful TED talk (where he wanted to go viral, but was instead a failed influencer) -- in the top 98% percentile of all videos, an Amherst regional school committee meeting (120 views, no likes), or a random video of "falling snow #4," for example, which has a median of 2 likes and 39 views. If the Amherst regional school committee meeting became viral, then something has gone badly wrong! But of these four types of videos, he's really interested in the ones that have few likes or views -- to understand what the vast majority of the internet really looks like. In creating this corpus of quotidian videos, his team has been able to do language detection on this corpus by using OpenAI Whisper, and then subdivided the data based on languages to understand the growth of non-English language content on YouTube. A great deal of this research can be found on TubeStats, which provides users with up-to-date information on YouTube based on random sampling. However, this data is mostly historical, and we can see how different parts of YouTube have changed over time.

His team is now developing a TubeStats for TikTok (TokStats), which is an easier sampling problem (they have discovered how to get every video, but don't yet have the capacity to fully do it). He thinks that TikTok is HUGE -- where they estimate YouTube to be at 14 billion videos, TikTok might be adding 2-4 billion videos per month. This is especially important because TikTok is a network of the global majority, particularly in countries like Pakistan, Bangladesh, and Indonesia. How much content does TikTok actually take down vs. keep up? These are all broader questions that still have to be answered.

So why study quotidian YouTube? This approach gives us the creator point of view rather than the audience or influencer POV. This also helps us solve the "denominator problem" -- it gives us context for other videos. (We can say that a piece of misinformation has a billion impressions, but a billion impressions out of how many?) This also allows us to do studies of cultural evolution and comparison, and metadata lets us know "where to dig." There are also broader questions about archiving.

There are a few new research directions that this work points to -- how are different groups using online video differently? A postdoc in Ethan's lab, Jane Pyo, is studying political content on Korean YouTube. The lab is also comparing language from influencers (Filmot) to random users by studying differences in language, imagery, and production. This is especially interesting when you think of YouTube as a social network, not as a broadcast service -- there are lots of short videos from India, where it seems like people are using it as a direct result of the TikTok ban or because of the way that WhatsApp is set up on their phones.

Folks likely know about the problem with WEIRD research -- it's hard to generalize from behavioral psychology research across populations. Are we introducing another bias -- prominence or influencer bias -- from the way that we currently do social media and internet research?

This is all especially important now as it is increasingly difficult to do social media research. Companies like Meta, X, etc. have made it legally and technically quite difficult to do this research through things like lawsuits and API restrictions. There is a real danger to thinking about what happens when we lose a significant toolbox to investigating how social media works as the U.S. enters the presidential election, for example -- it's almost impossible for researchers to gain permissioned access to data like the corpus that Meta released to researchers years ago. As he's argued elsewhere, the internet is becoming increasingly unknowable. He ends with an invitation: there is still a lot of work to be done, and my team and I do not have all of the skills (linguistic and technological) to pull it off. Understanding the wider landscape of the internet -- the mundane and quotidian -- requires a large collaborative approach.