Movie Dataset: the 23 Best Data Sets Related to Cinema and TV

Written by

open data ent.001

Data on Entertainment

Even though Hollywood may seem so glittery from the outside, on the inside it’s an industry like any other. And the currency of modern business is often represented by data.

With this post I’ve tried to sum up all of the main sources of publicly accessible data on entertainment. I’ve privileged sources that are also open: whereas you can use the data for your purposes under a certain license.

Why so few open data sources?

Of the data that allows the movie and TV industries to function, very little is open. In the table below I’ve selected all the sources that feature more than 1.000 items. Some even have a proper open data license. Most of them don’t.

If you see any source missing please let me know in the comments below and I will add it.

Open Data Entertainment Sources

NameDescriptionFormatLicenseNotesLast Update
IMDb datasets
The trove of troves of entertainment dataTSVNon-Commercial LicensingData is refreshed dailyCurrent
WikidataWikidata is a free and open knowledge base that also contains structured info around movies and tv.SPARQLCreative Commons CC0 LicenseDB is kept up to date with most current movies and TV shows. For example with this query you can pull up all movies release in 2017Current
The Movie DB APIThe Movie Database (TMDb) is a community built movie and TV database. JSONMust properly attribute TMDb as the sourceDB is kept up to date with most current movies and TV showsCurrent
OMDB ApiCrowdsourced movie informationJSONCC BY-NC 4.0DB is kept up to date with most current movies
LinkedMDBopen semantic web database for movies, including a large number of interlinks to several datasetsSPARQLUncertainDB seems to entail mostly historical movies
The TV DBAn open database for television fansXML APICreative Commons Attribution-NonCommercial 4.0 International License.Current
Movielens datasetsMovie rating data sets from the MovieLens web site. 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 usersCSVUniversity of Minnesota retains certain rights. No commercial use without permissionMovielens is kept up to date, but datasets are updated periodically8/2017
Cornell – Movie Review DataSentiment analysis of movie reviewsTXTUncertain2004
UC Irvine – Movie Data SetA dataset of 10.000 movies with info on actors, casts, directors, producers, studios.HTMLCopyright held by Gio Wiederhold, 1990-1999. This data may not be used for commercial resale. Unfriendly format1999
WikiPlotsRather than a dataset, a Phython tool that allows you to create an up to date dataset of movie and book plots from Wikipedia.PhythonUncertain but seems openBetter than a datasetCurrent
Cornell – Movie Dialogs Corpusa large metadata-rich collection of fictional conversations extracted from 617 raw movie scriptsTXTUncertain2011
TMDB 5000 Movie DatasetMetadata on around 5,000 movies from TMDbJsonUncertainOctober 2017
The Movies DatasetMetadata on over 45,000 movies. 26 million ratings from over 270,000 users derived from different sources.CSVInherits multiple licenses: IMDb, TMDb, MovielensDecember 2017
32K movies with subtitles and metadataUseful for linguistical analysisXLSXUncertainJuly 2017
French National Cinema Center datasetsVarious datasets related to French cinema including box office dataAPI/CSVOpen License

While looking for public data sets I’ve found countless closed services, some of those are worth mentioning here.

Recommendation engines

Good Films is a data service and a social network based on a movie recommendation engine. This service aims at facilitating the discovery of movies worth watching by searching the specific subsets of movies available on specific services like Netflix and iTunes.

Movielens is a closed source movie recommendation engine based on current data. The aim of the this site is to provide you with movies you will want to watch. The site does not provide links to platforms where you can actually enjoy the newfound movies or buy tickets. The site is a research project run by GroupLens Research at the University of Minnesota.

Box office data

Box Office Mojo is the site where to look up movie performance since the dawn of ages. It’s now part of the Amazon/IMDb family.

The Numbers “Where Data and the Movie Business Meet”; so the site defines itself. It features closed source box office data, a movie star bank-ability index, analysis of the theatrical market but also Home Video data.

JP Box Office Similar to Box Office Mojo, but made in France. This website aggregates a ton of business data around movie releases including both US, China, France, Germany and Italy.

Oddball collections

In the midst of my research I’ve also found data sources that have no immediate business use, but that are nonetheless worth mentioning.

Cornell Movie-Dialogs Corpus is a large metadata-rich collection of fictional conversations extracted from raw movie scripts. Historical and not updated.

The Harry Ransom Center’s Movie Posters Collection is a fun find and totally free to browse and enjoy on the web.

Scripts from all Seinfeld episodes. That’s an odd and beautiful dataset.

If there’s anything I’ve missed please let me know in the comments. And if you liked what you read here, please subscribe to my almost monthly newsletter where I tackle marketing, productivity, entertainment and innovation.

Pin It on Pinterest

Share This