explicitClick to confirm you are 18+

YouTube Restricted Mode Data Scraping

watcherintheweedsApr 3, 2017, 9:59:43 PM
thumb_up12thumb_downmore_vert

TL;DR proposed an interesting question about how YouTube labels videos as restricted. Specifically, he proposed that YouTube was using the title, tags, description and caption text to classify videos. To test this hypothesis we can use the YouTube Data API to scrape information from the videos. So far I have found an easy way to grab the title, tags and description data. Unfortunately, the caption appears to be limited. The API indicates that only the owner of the video can automatically download the caption text unless the owner of the video has enabled third-party contributions. As such I cannot easily get the caption data.

Furthermore, I have yet to find a way to automatically and simply scrape for which videos are restricted and which are not. As such I have chosen to scrape data from channels which have either most videos restricted or most videos unrestricted. The scraper is honestly a piece of junk as I only spent 2 hours or so getting the most basic functionality working and is basically spliced together from the Data API examples.

If you want to run the code you can grab it here: https://drive.google.com/file/d/0ByWtPh-r8p9KMWRkb04tc05wWUE/view?usp=sharing and https://drive.google.com/file/d/0ByWtPh-r8p9KdF94d0NBZ2NTeDA/view?usp=sharing. It relies on the example projects provided by YouTube available here: https://github.com/youtube/api-samples. You can get the Data API setup fairly quickly by following the tutorial here: https://www.youtube.com/watch?v=pb_t5_ShQOM&feature=youtu.be. I recommend the Maven setup as it is considerably simpler.

You will need to setup a dev project here: https://console.developers.google.com/apis/library under your account and give permission for OAuth2. Once you have a project setup you can get the information needed to create the JSON client-secrets.json so you can connect to the Data API under your account. Once you have the client-secrete.json filled in you can simply run Scraper.java. It will open a link and request permission for the specific instance. Scraper.java requires two command line arguments: first it needs the channel id and second it needs a string for the file to output the data into. It does no checking so if you don’t provide the right info it will simply crash on you. Example:

Java Scraper UCL_f53ZEJxp8TtlOkHwMV9Q jbp

Will run the Scaper on Jordan Peterson’s channel and save the output into jbp.csv. It also saves a serialized DataOfInterest for testing if you don’t want to keep pulling the data down. I do stress that this code is pure junk and I would not recommend using it as an example for much of anything other than the specifics needed to grab the YouTube data.

------------------

More interestingly here is the data for Jordan B Peterson, Bearing, TL;DR and Kraut and Tea’s channels.

https://docs.google.com/spreadsheets/d/1ywwhnIzEUPdmA8WyRW9ZPpGII2pwovP5fNq6IuFLk0A/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1aqn5P7cDc0H9bhZWo_c0PgVHQenR0fZgRwOKHLGE66E/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1hxNavtXK-8sDIQWIf98WcD5Bj99eMqMlCqS3w5oxs6o/edit?usp=sharing

https://docs.google.com/spreadsheets/d/1pPiUqkaXbYzNspIpaaG5LBcK7n6Z6krU8whilKBiK0I/edit?usp=sharing

I ran this data through some basic Information Retrieval (IR) algorithms. Pulled out all punctuation, turned the documents (videos) into bags of words and did a basic term frequency, inverse document frequency analysis to generate a nice big vector representation of each video’s text.

I then took that data and ran it through a basic decision tree classifier. The results are here: 

https://drive.google.com/file/d/0ByWtPh-r8p9KbjZYT25tUHNUYUE/view?usp=sharing

Unfortunately as you can see from the results we have very bad classification. The special cases for each classification means that the classifier could not find any real pattern for which videos are restricted and which are not. This means one of the following: 

1) There is not enough Data

2) YouTube does not use the Name, Tag, Description for classification 

3) The decision tree is a bad model 

4) I screwed up something 

Honestly, I suspect YouTube uses the caption data for classification. However, I would need help from the content creators like TL;DR to get that data down reasonably.