93 lines
		
	
	
		
			3.9 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			93 lines
		
	
	
		
			3.9 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Large Movie Review Dataset v1.0
 | |
| 
 | |
| Overview
 | |
| 
 | |
| This dataset contains movie reviews along with their associated binary
 | |
| sentiment polarity labels. It is intended to serve as a benchmark for
 | |
| sentiment classification. This document outlines how the dataset was
 | |
| gathered, and how to use the files provided. 
 | |
| 
 | |
| Dataset 
 | |
| 
 | |
| The core dataset contains 50,000 reviews split evenly into 25k train
 | |
| and 25k test sets. The overall distribution of labels is balanced (25k
 | |
| pos and 25k neg). We also include an additional 50,000 unlabeled
 | |
| documents for unsupervised learning. 
 | |
| 
 | |
| In the entire collection, no more than 30 reviews are allowed for any
 | |
| given movie because reviews for the same movie tend to have correlated
 | |
| ratings. Further, the train and test sets contain a disjoint set of
 | |
| movies, so no significant performance is obtained by memorizing
 | |
| movie-unique terms and their associated with observed labels.  In the
 | |
| labeled train/test sets, a negative review has a score <= 4 out of 10,
 | |
| and a positive review has a score >= 7 out of 10. Thus reviews with
 | |
| more neutral ratings are not included in the train/test sets. In the
 | |
| unsupervised set, reviews of any rating are included and there are an
 | |
| even number of reviews > 5 and <= 5.
 | |
| 
 | |
| Files
 | |
| 
 | |
| There are two top-level directories [train/, test/] corresponding to
 | |
| the training and test sets. Each contains [pos/, neg/] directories for
 | |
| the reviews with binary labels positive and negative. Within these
 | |
| directories, reviews are stored in text files named following the
 | |
| convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
 | |
| the star rating for that review on a 1-10 scale. For example, the file
 | |
| [test/pos/200_8.txt] is the text for a positive-labeled test set
 | |
| example with unique id 200 and star rating 8/10 from IMDb. The
 | |
| [train/unsup/] directory has 0 for all ratings because the ratings are
 | |
| omitted for this portion of the dataset.
 | |
| 
 | |
| We also include the IMDb URLs for each review in a separate
 | |
| [urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
 | |
| have its URL on line 200 of this file. Due the ever-changing IMDb, we
 | |
| are unable to link directly to the review, but only to the movie's
 | |
| review page.
 | |
| 
 | |
| In addition to the review text files, we include already-tokenized bag
 | |
| of words (BoW) features that were used in our experiments. These 
 | |
| are stored in .feat files in the train/test directories. Each .feat
 | |
| file is in LIBSVM format, an ascii sparse-vector format for labeled
 | |
| data.  The feature indices in these files start from 0, and the text
 | |
| tokens corresponding to a feature index is found in [imdb.vocab]. So a
 | |
| line with 0:7 in a .feat file means the first word in [imdb.vocab]
 | |
| (the) appears 7 times in that review.
 | |
| 
 | |
| LIBSVM page for details on .feat file format:
 | |
| http://www.csie.ntu.edu.tw/~cjlin/libsvm/
 | |
| 
 | |
| We also include [imdbEr.txt] which contains the expected rating for
 | |
| each token in [imdb.vocab] as computed by (Potts, 2011). The expected
 | |
| rating is a good way to get a sense for the average polarity of a word
 | |
| in the dataset.
 | |
| 
 | |
| Citing the dataset
 | |
| 
 | |
| When using this dataset please cite our ACL 2011 paper which
 | |
| introduces it. This paper also contains classification results which
 | |
| you may want to compare against.
 | |
| 
 | |
| 
 | |
| @InProceedings{maas-EtAl:2011:ACL-HLT2011,
 | |
|   author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
 | |
|   title     = {Learning Word Vectors for Sentiment Analysis},
 | |
|   booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
 | |
|   month     = {June},
 | |
|   year      = {2011},
 | |
|   address   = {Portland, Oregon, USA},
 | |
|   publisher = {Association for Computational Linguistics},
 | |
|   pages     = {142--150},
 | |
|   url       = {http://www.aclweb.org/anthology/P11-1015}
 | |
| }
 | |
| 
 | |
| References
 | |
| 
 | |
| Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
 | |
| David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
 | |
| 636-659.
 | |
| 
 | |
| Contact
 | |
| 
 | |
| For questions/comments/corrections please contact Andrew Maas
 | |
| amaas@cs.stanford.edu
 |