init repo
This commit is contained in:
		
							
								
								
									
										92
									
								
								Questions/Q3/aclImdb/README
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										92
									
								
								Questions/Q3/aclImdb/README
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,92 @@ | ||||
| Large Movie Review Dataset v1.0 | ||||
|  | ||||
| Overview | ||||
|  | ||||
| This dataset contains movie reviews along with their associated binary | ||||
| sentiment polarity labels. It is intended to serve as a benchmark for | ||||
| sentiment classification. This document outlines how the dataset was | ||||
| gathered, and how to use the files provided.  | ||||
|  | ||||
| Dataset  | ||||
|  | ||||
| The core dataset contains 50,000 reviews split evenly into 25k train | ||||
| and 25k test sets. The overall distribution of labels is balanced (25k | ||||
| pos and 25k neg). We also include an additional 50,000 unlabeled | ||||
| documents for unsupervised learning.  | ||||
|  | ||||
| In the entire collection, no more than 30 reviews are allowed for any | ||||
| given movie because reviews for the same movie tend to have correlated | ||||
| ratings. Further, the train and test sets contain a disjoint set of | ||||
| movies, so no significant performance is obtained by memorizing | ||||
| movie-unique terms and their associated with observed labels.  In the | ||||
| labeled train/test sets, a negative review has a score <= 4 out of 10, | ||||
| and a positive review has a score >= 7 out of 10. Thus reviews with | ||||
| more neutral ratings are not included in the train/test sets. In the | ||||
| unsupervised set, reviews of any rating are included and there are an | ||||
| even number of reviews > 5 and <= 5. | ||||
|  | ||||
| Files | ||||
|  | ||||
| There are two top-level directories [train/, test/] corresponding to | ||||
| the training and test sets. Each contains [pos/, neg/] directories for | ||||
| the reviews with binary labels positive and negative. Within these | ||||
| directories, reviews are stored in text files named following the | ||||
| convention [[id]_[rating].txt] where [id] is a unique id and [rating] is | ||||
| the star rating for that review on a 1-10 scale. For example, the file | ||||
| [test/pos/200_8.txt] is the text for a positive-labeled test set | ||||
| example with unique id 200 and star rating 8/10 from IMDb. The | ||||
| [train/unsup/] directory has 0 for all ratings because the ratings are | ||||
| omitted for this portion of the dataset. | ||||
|  | ||||
| We also include the IMDb URLs for each review in a separate | ||||
| [urls_[pos, neg, unsup].txt] file. A review with unique id 200 will | ||||
| have its URL on line 200 of this file. Due the ever-changing IMDb, we | ||||
| are unable to link directly to the review, but only to the movie's | ||||
| review page. | ||||
|  | ||||
| In addition to the review text files, we include already-tokenized bag | ||||
| of words (BoW) features that were used in our experiments. These  | ||||
| are stored in .feat files in the train/test directories. Each .feat | ||||
| file is in LIBSVM format, an ascii sparse-vector format for labeled | ||||
| data.  The feature indices in these files start from 0, and the text | ||||
| tokens corresponding to a feature index is found in [imdb.vocab]. So a | ||||
| line with 0:7 in a .feat file means the first word in [imdb.vocab] | ||||
| (the) appears 7 times in that review. | ||||
|  | ||||
| LIBSVM page for details on .feat file format: | ||||
| http://www.csie.ntu.edu.tw/~cjlin/libsvm/ | ||||
|  | ||||
| We also include [imdbEr.txt] which contains the expected rating for | ||||
| each token in [imdb.vocab] as computed by (Potts, 2011). The expected | ||||
| rating is a good way to get a sense for the average polarity of a word | ||||
| in the dataset. | ||||
|  | ||||
| Citing the dataset | ||||
|  | ||||
| When using this dataset please cite our ACL 2011 paper which | ||||
| introduces it. This paper also contains classification results which | ||||
| you may want to compare against. | ||||
|  | ||||
|  | ||||
| @InProceedings{maas-EtAl:2011:ACL-HLT2011, | ||||
|   author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher}, | ||||
|   title     = {Learning Word Vectors for Sentiment Analysis}, | ||||
|   booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, | ||||
|   month     = {June}, | ||||
|   year      = {2011}, | ||||
|   address   = {Portland, Oregon, USA}, | ||||
|   publisher = {Association for Computational Linguistics}, | ||||
|   pages     = {142--150}, | ||||
|   url       = {http://www.aclweb.org/anthology/P11-1015} | ||||
| } | ||||
|  | ||||
| References | ||||
|  | ||||
| Potts, Christopher. 2011. On the negativity of negation. In Nan Li and | ||||
| David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20, | ||||
| 636-659. | ||||
|  | ||||
| Contact | ||||
|  | ||||
| For questions/comments/corrections please contact Andrew Maas | ||||
| amaas@cs.stanford.edu | ||||
		Reference in New Issue
	
	Block a user