init repo
This commit is contained in:
File diff suppressed because one or more lines are too long
92
Questions/Q3/aclImdb/README
Normal file
92
Questions/Q3/aclImdb/README
Normal file
@ -0,0 +1,92 @@
|
||||
Large Movie Review Dataset v1.0
|
||||
|
||||
Overview
|
||||
|
||||
This dataset contains movie reviews along with their associated binary
|
||||
sentiment polarity labels. It is intended to serve as a benchmark for
|
||||
sentiment classification. This document outlines how the dataset was
|
||||
gathered, and how to use the files provided.
|
||||
|
||||
Dataset
|
||||
|
||||
The core dataset contains 50,000 reviews split evenly into 25k train
|
||||
and 25k test sets. The overall distribution of labels is balanced (25k
|
||||
pos and 25k neg). We also include an additional 50,000 unlabeled
|
||||
documents for unsupervised learning.
|
||||
|
||||
In the entire collection, no more than 30 reviews are allowed for any
|
||||
given movie because reviews for the same movie tend to have correlated
|
||||
ratings. Further, the train and test sets contain a disjoint set of
|
||||
movies, so no significant performance is obtained by memorizing
|
||||
movie-unique terms and their associated with observed labels. In the
|
||||
labeled train/test sets, a negative review has a score <= 4 out of 10,
|
||||
and a positive review has a score >= 7 out of 10. Thus reviews with
|
||||
more neutral ratings are not included in the train/test sets. In the
|
||||
unsupervised set, reviews of any rating are included and there are an
|
||||
even number of reviews > 5 and <= 5.
|
||||
|
||||
Files
|
||||
|
||||
There are two top-level directories [train/, test/] corresponding to
|
||||
the training and test sets. Each contains [pos/, neg/] directories for
|
||||
the reviews with binary labels positive and negative. Within these
|
||||
directories, reviews are stored in text files named following the
|
||||
convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
|
||||
the star rating for that review on a 1-10 scale. For example, the file
|
||||
[test/pos/200_8.txt] is the text for a positive-labeled test set
|
||||
example with unique id 200 and star rating 8/10 from IMDb. The
|
||||
[train/unsup/] directory has 0 for all ratings because the ratings are
|
||||
omitted for this portion of the dataset.
|
||||
|
||||
We also include the IMDb URLs for each review in a separate
|
||||
[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
|
||||
have its URL on line 200 of this file. Due the ever-changing IMDb, we
|
||||
are unable to link directly to the review, but only to the movie's
|
||||
review page.
|
||||
|
||||
In addition to the review text files, we include already-tokenized bag
|
||||
of words (BoW) features that were used in our experiments. These
|
||||
are stored in .feat files in the train/test directories. Each .feat
|
||||
file is in LIBSVM format, an ascii sparse-vector format for labeled
|
||||
data. The feature indices in these files start from 0, and the text
|
||||
tokens corresponding to a feature index is found in [imdb.vocab]. So a
|
||||
line with 0:7 in a .feat file means the first word in [imdb.vocab]
|
||||
(the) appears 7 times in that review.
|
||||
|
||||
LIBSVM page for details on .feat file format:
|
||||
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
|
||||
|
||||
We also include [imdbEr.txt] which contains the expected rating for
|
||||
each token in [imdb.vocab] as computed by (Potts, 2011). The expected
|
||||
rating is a good way to get a sense for the average polarity of a word
|
||||
in the dataset.
|
||||
|
||||
Citing the dataset
|
||||
|
||||
When using this dataset please cite our ACL 2011 paper which
|
||||
introduces it. This paper also contains classification results which
|
||||
you may want to compare against.
|
||||
|
||||
|
||||
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
|
||||
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
|
||||
title = {Learning Word Vectors for Sentiment Analysis},
|
||||
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
|
||||
month = {June},
|
||||
year = {2011},
|
||||
address = {Portland, Oregon, USA},
|
||||
publisher = {Association for Computational Linguistics},
|
||||
pages = {142--150},
|
||||
url = {http://www.aclweb.org/anthology/P11-1015}
|
||||
}
|
||||
|
||||
References
|
||||
|
||||
Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
|
||||
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
|
||||
636-659.
|
||||
|
||||
Contact
|
||||
|
||||
For questions/comments/corrections please contact Andrew Maas
|
||||
amaas@cs.stanford.edu
|
89527
Questions/Q3/aclImdb/imdb.vocab
Normal file
89527
Questions/Q3/aclImdb/imdb.vocab
Normal file
File diff suppressed because it is too large
Load Diff
25000
Questions/Q3/aclImdb/test/labeledBow.feat
Normal file
25000
Questions/Q3/aclImdb/test/labeledBow.feat
Normal file
File diff suppressed because one or more lines are too long
25000
Questions/Q3/aclImdb/train/labeledBow.feat
Normal file
25000
Questions/Q3/aclImdb/train/labeledBow.feat
Normal file
File diff suppressed because it is too large
Load Diff
50000
Questions/Q3/aclImdb/train/unsupBow.feat
Normal file
50000
Questions/Q3/aclImdb/train/unsupBow.feat
Normal file
File diff suppressed because one or more lines are too long
BIN
Questions/Q3/aclImdb_v1.tar.gz
Normal file
BIN
Questions/Q3/aclImdb_v1.tar.gz
Normal file
Binary file not shown.
202
Questions/Q3/conda_env.yml
Normal file
202
Questions/Q3/conda_env.yml
Normal file
@ -0,0 +1,202 @@
|
||||
name: tf
|
||||
channels:
|
||||
- apple
|
||||
- conda-forge
|
||||
dependencies:
|
||||
- appnope=0.1.3=pyhd8ed1ab_0
|
||||
- argon2-cffi=21.3.0=pyhd8ed1ab_0
|
||||
- argon2-cffi-bindings=21.2.0=py39hb18efdd_2
|
||||
- asttokens=2.0.5=pyhd8ed1ab_0
|
||||
- attrs=21.4.0=pyhd8ed1ab_0
|
||||
- backcall=0.2.0=pyh9f0ad1d_0
|
||||
- backports=1.0=py_2
|
||||
- backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
|
||||
- beautifulsoup4=4.11.1=pyha770c72_0
|
||||
- bleach=5.0.0=pyhd8ed1ab_0
|
||||
- brotli=1.0.9=h1c322ee_7
|
||||
- brotli-bin=1.0.9=h1c322ee_7
|
||||
- c-ares=1.18.1=h3422bc3_0
|
||||
- ca-certificates=2022.5.18.1=h4653dfc_0
|
||||
- cached-property=1.5.2=hd8ed1ab_1
|
||||
- cached_property=1.5.2=pyha770c72_1
|
||||
- cffi=1.14.6=py39hda8b47f_0
|
||||
- click=8.1.3=py39h2804cbe_0
|
||||
- colorama=0.4.4=pyh9f0ad1d_0
|
||||
- commonmark=0.9.1=py_0
|
||||
- cycler=0.11.0=pyhd8ed1ab_0
|
||||
- dataclasses=0.8=pyhc8e2a94_3
|
||||
- debugpy=1.6.0=py39h0ef5a74_0
|
||||
- decorator=5.1.1=pyhd8ed1ab_0
|
||||
- defusedxml=0.7.1=pyhd8ed1ab_0
|
||||
- entrypoints=0.4=pyhd8ed1ab_0
|
||||
- executing=0.8.3=pyhd8ed1ab_0
|
||||
- flit-core=3.7.1=pyhd8ed1ab_0
|
||||
- fonttools=4.33.3=py39h9eb174b_0
|
||||
- freetype=2.10.4=h17b34a0_1
|
||||
- future=0.18.2=py39h2804cbe_5
|
||||
- giflib=5.2.1=h27ca646_2
|
||||
- grpcio=1.46.3=py39h518bade_0
|
||||
- h5py=3.6.0=nompi_py39hd982b79_100
|
||||
- hdf5=1.12.1=nompi_hf9525e8_104
|
||||
- importlib-metadata=4.11.4=py39h2804cbe_0
|
||||
- importlib_resources=5.7.1=pyhd8ed1ab_1
|
||||
- ipykernel=6.13.1=py39h32adebf_0
|
||||
- ipython=8.4.0=py39h2804cbe_0
|
||||
- ipython_genutils=0.2.0=py_1
|
||||
- ipywidgets=7.7.0=pyhd8ed1ab_0
|
||||
- jedi=0.18.1=py39h2804cbe_1
|
||||
- jinja2=3.1.2=pyhd8ed1ab_1
|
||||
- joblib=1.1.0=pyhd8ed1ab_0
|
||||
- jpeg=9e=h1c322ee_1
|
||||
- jsonschema=4.6.0=pyhd8ed1ab_0
|
||||
- jupyter=1.0.0=py39h2804cbe_7
|
||||
- jupyter_client=7.3.4=pyhd8ed1ab_0
|
||||
- jupyter_console=6.4.3=pyhd8ed1ab_0
|
||||
- jupyter_core=4.10.0=py39h2804cbe_0
|
||||
- jupyterlab_pygments=0.2.2=pyhd8ed1ab_0
|
||||
- jupyterlab_widgets=1.1.0=pyhd8ed1ab_0
|
||||
- jupyterthemes=0.20.0=py_1
|
||||
- kiwisolver=1.4.2=py39h2c803a9_1
|
||||
- krb5=1.19.3=hf9b2bbe_0
|
||||
- lcms2=2.12=had6a04f_0
|
||||
- lerc=3.0=hbdafb3b_0
|
||||
- lesscpy=0.15.0=pyhd8ed1ab_0
|
||||
- libblas=3.9.0=15_osxarm64_openblas
|
||||
- libbrotlicommon=1.0.9=h1c322ee_7
|
||||
- libbrotlidec=1.0.9=h1c322ee_7
|
||||
- libbrotlienc=1.0.9=h1c322ee_7
|
||||
- libcblas=3.9.0=15_osxarm64_openblas
|
||||
- libcurl=7.83.1=h2fcd78c_0
|
||||
- libcxx=14.0.4=h6a5c8ee_0
|
||||
- libdeflate=1.10=h3422bc3_0
|
||||
- libedit=3.1.20191231=hc8eb9b7_2
|
||||
- libev=4.33=h642e427_1
|
||||
- libffi=3.3=h9f76cd9_2
|
||||
- libgfortran=5.0.0.dev0=11_0_1_hf114ba7_23
|
||||
- libgfortran5=11.0.1.dev0=hf114ba7_23
|
||||
- liblapack=3.9.0=15_osxarm64_openblas
|
||||
- libnghttp2=1.47.0=he723fca_0
|
||||
- libopenblas=0.3.20=openmp_h2209c59_0
|
||||
- libpng=1.6.37=hf7e6567_2
|
||||
- libsodium=1.0.18=h27ca646_1
|
||||
- libssh2=1.10.0=hb80f160_2
|
||||
- libtiff=4.4.0=h2810ee2_0
|
||||
- libwebp=1.2.2=h0d20362_0
|
||||
- libwebp-base=1.2.2=h3422bc3_1
|
||||
- libxcb=1.13=h9b22ae9_1004
|
||||
- libzlib=1.2.12=h90dfc92_0
|
||||
- llvm-openmp=14.0.4=hd125106_0
|
||||
- lz4-c=1.9.3=hbdafb3b_1
|
||||
- markupsafe=2.1.1=py39hb18efdd_1
|
||||
- matplotlib-base=3.5.2=py39h4ee150f_0
|
||||
- matplotlib-inline=0.1.3=pyhd8ed1ab_0
|
||||
- mistune=0.8.4=py39h5161555_1005
|
||||
- munkres=1.1.4=pyh9f0ad1d_0
|
||||
- nbclient=0.5.13=pyhd8ed1ab_0
|
||||
- nbconvert=6.4.5=py39h2804cbe_0
|
||||
- nbformat=5.4.0=pyhd8ed1ab_0
|
||||
- ncurses=6.3=h07bb92c_1
|
||||
- nest-asyncio=1.5.5=pyhd8ed1ab_0
|
||||
- nltk=3.6.7=pyhd8ed1ab_0
|
||||
- notebook=6.4.12=pyha770c72_0
|
||||
- numpy=1.22.4=py39h7df2422_0
|
||||
- openjpeg=2.4.0=h062765e_1
|
||||
- openssl=1.1.1o=ha287fd2_0
|
||||
- packaging=21.3=pyhd8ed1ab_0
|
||||
- pandas=1.4.2=py39hd2dba81_2
|
||||
- pandocfilters=1.5.0=pyhd8ed1ab_0
|
||||
- parso=0.8.3=pyhd8ed1ab_0
|
||||
- patsy=0.5.2=pyhd8ed1ab_0
|
||||
- pexpect=4.8.0=pyh9f0ad1d_2
|
||||
- pickleshare=0.7.5=py_1003
|
||||
- pillow=9.1.1=py39h1b8be2f_1
|
||||
- pip=22.1.2=pyhd8ed1ab_0
|
||||
- ply=3.11=py_1
|
||||
- prometheus_client=0.14.1=pyhd8ed1ab_0
|
||||
- prompt-toolkit=3.0.29=pyha770c72_0
|
||||
- prompt_toolkit=3.0.29=hd8ed1ab_0
|
||||
- psutil=5.9.1=py39h9eb174b_0
|
||||
- pthread-stubs=0.4=h27ca646_1001
|
||||
- ptyprocess=0.7.0=pyhd3deb0d_0
|
||||
- pure_eval=0.2.2=pyhd8ed1ab_0
|
||||
- pycparser=2.21=pyhd8ed1ab_0
|
||||
- pygments=2.12.0=pyhd8ed1ab_0
|
||||
- pyparsing=3.0.9=pyhd8ed1ab_0
|
||||
- pyrsistent=0.18.1=py39hb18efdd_1
|
||||
- python=3.9.0=h4b4120c_5_cpython
|
||||
- python-dateutil=2.8.2=pyhd8ed1ab_0
|
||||
- python-fastjsonschema=2.15.3=pyhd8ed1ab_0
|
||||
- python_abi=3.9=2_cp39
|
||||
- pytz=2022.1=pyhd8ed1ab_0
|
||||
- pyzmq=23.1.0=py39h8faa4b9_0
|
||||
- readline=8.1.2=h46ed386_0
|
||||
- regex=2022.6.2=py39h9eb174b_0
|
||||
- rich=12.4.4=pyhd8ed1ab_0
|
||||
- scikit-learn=1.1.1=py39h255bef5_0
|
||||
- scipy=1.8.1=py39h14896cb_0
|
||||
- seaborn=0.11.2=hd8ed1ab_0
|
||||
- seaborn-base=0.11.2=pyhd8ed1ab_0
|
||||
- send2trash=1.8.0=pyhd8ed1ab_0
|
||||
- setuptools=62.3.4=py39h2804cbe_0
|
||||
- soupsieve=2.3.1=pyhd8ed1ab_0
|
||||
- sqlite=3.38.5=h40dfcc0_0
|
||||
- stack_data=0.2.0=pyhd8ed1ab_0
|
||||
- statsmodels=0.13.2=py39h7b9fbcb_0
|
||||
- tensorflow-deps=2.9.0=0
|
||||
- terminado=0.15.0=py39h2804cbe_0
|
||||
- testpath=0.6.0=pyhd8ed1ab_0
|
||||
- threadpoolctl=3.1.0=pyh8a188c0_0
|
||||
- tk=8.6.12=he1e0b03_0
|
||||
- tornado=6.1=py39hb18efdd_3
|
||||
- tqdm=4.64.0=pyhd8ed1ab_0
|
||||
- traitlets=5.2.2.post1=pyhd8ed1ab_0
|
||||
- typing_extensions=4.2.0=pyha770c72_1
|
||||
- tzdata=2022a=h191b570_0
|
||||
- unicodedata2=14.0.0=py39hb18efdd_1
|
||||
- wcwidth=0.2.5=pyh9f0ad1d_2
|
||||
- webencodings=0.5.1=py_1
|
||||
- wheel=0.37.1=pyhd8ed1ab_0
|
||||
- widgetsnbextension=3.6.0=py39h2804cbe_0
|
||||
- xorg-libxau=1.0.9=h27ca646_0
|
||||
- xorg-libxdmcp=1.1.3=h27ca646_0
|
||||
- xz=5.2.5=h642e427_1
|
||||
- zeromq=4.3.4=hbdafb3b_1
|
||||
- zipp=3.8.0=pyhd8ed1ab_0
|
||||
- zlib=1.2.12=h90dfc92_0
|
||||
- zstd=1.5.2=hd705a24_1
|
||||
- pip:
|
||||
- absl-py==1.1.0
|
||||
- astunparse==1.6.3
|
||||
- cachetools==5.2.0
|
||||
- certifi==2022.5.18.1
|
||||
- charset-normalizer==2.0.12
|
||||
- flatbuffers==1.12
|
||||
- gast==0.4.0
|
||||
- google-auth==2.7.0
|
||||
- google-auth-oauthlib==0.4.6
|
||||
- google-pasta==0.2.0
|
||||
- idna==3.3
|
||||
- keras==2.9.0
|
||||
- keras-preprocessing==1.1.2
|
||||
- libclang==14.0.1
|
||||
- markdown==3.3.7
|
||||
- oauthlib==3.2.0
|
||||
- opt-einsum==3.3.0
|
||||
- protobuf==3.19.4
|
||||
- pyasn1==0.4.8
|
||||
- pyasn1-modules==0.2.8
|
||||
- requests==2.28.0
|
||||
- requests-oauthlib==1.3.1
|
||||
- rsa==4.8
|
||||
- six==1.15.0
|
||||
- tensorboard==2.9.1
|
||||
- tensorboard-data-server==0.6.1
|
||||
- tensorboard-plugin-wit==1.8.1
|
||||
- tensorflow-estimator==2.9.0
|
||||
- tensorflow-macos==2.9.2
|
||||
- tensorflow-metal==0.5.0
|
||||
- termcolor==1.1.0
|
||||
- urllib3==1.26.9
|
||||
- werkzeug==2.1.2
|
||||
- wrapt==1.14.1
|
||||
prefix: /Users/xiao_deng/miniforge3/envs/tf
|
BIN
Questions/Q3/imdb_cls_model.h5
Normal file
BIN
Questions/Q3/imdb_cls_model.h5
Normal file
Binary file not shown.
911
Questions/Q3/imdb_reviews_cls.ipynb
Normal file
911
Questions/Q3/imdb_reviews_cls.ipynb
Normal file
File diff suppressed because one or more lines are too long
Binary file not shown.
Binary file not shown.
BIN
Questions/Q3/word_code_map.pkl
Normal file
BIN
Questions/Q3/word_code_map.pkl
Normal file
Binary file not shown.
Reference in New Issue
Block a user