init repo

2022-10-15 11:36:03 +08:00
parent 180767d408
commit 0e475aa135
37 changed files with 204019 additions and 162 deletions
--- a/Questions/Q3/.ipynb_checkpoints/imdb_reviews_cls-checkpoint.ipynb
+++ b/Questions/Q3/.ipynb_checkpoints/imdb_reviews_cls-checkpoint.ipynb
--- a/Questions/Q3/aclImdb/README
+++ b/Questions/Q3/aclImdb/README
@ -0,0 +1,92 @@
+Large Movie Review Dataset v1.0
+
+Overview
+
+This dataset contains movie reviews along with their associated binary
+sentiment polarity labels. It is intended to serve as a benchmark for
+sentiment classification. This document outlines how the dataset was
+gathered, and how to use the files provided. 
+
+Dataset 
+
+The core dataset contains 50,000 reviews split evenly into 25k train
+and 25k test sets. The overall distribution of labels is balanced (25k
+pos and 25k neg). We also include an additional 50,000 unlabeled
+documents for unsupervised learning. 
+
+In the entire collection, no more than 30 reviews are allowed for any
+given movie because reviews for the same movie tend to have correlated
+ratings. Further, the train and test sets contain a disjoint set of
+movies, so no significant performance is obtained by memorizing
+movie-unique terms and their associated with observed labels.  In the
+labeled train/test sets, a negative review has a score <= 4 out of 10,
+and a positive review has a score >= 7 out of 10. Thus reviews with
+more neutral ratings are not included in the train/test sets. In the
+unsupervised set, reviews of any rating are included and there are an
+even number of reviews > 5 and <= 5.
+
+Files
+
+There are two top-level directories [train/, test/] corresponding to
+the training and test sets. Each contains [pos/, neg/] directories for
+the reviews with binary labels positive and negative. Within these
+directories, reviews are stored in text files named following the
+convention [[id]_[rating].txt] where [id] is a unique id and [rating] is
+the star rating for that review on a 1-10 scale. For example, the file
+[test/pos/200_8.txt] is the text for a positive-labeled test set
+example with unique id 200 and star rating 8/10 from IMDb. The
+[train/unsup/] directory has 0 for all ratings because the ratings are
+omitted for this portion of the dataset.
+
+We also include the IMDb URLs for each review in a separate
+[urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
+have its URL on line 200 of this file. Due the ever-changing IMDb, we
+are unable to link directly to the review, but only to the movie's
+review page.
+
+In addition to the review text files, we include already-tokenized bag
+of words (BoW) features that were used in our experiments. These 
+are stored in .feat files in the train/test directories. Each .feat
+file is in LIBSVM format, an ascii sparse-vector format for labeled
+data.  The feature indices in these files start from 0, and the text
+tokens corresponding to a feature index is found in [imdb.vocab]. So a
+line with 0:7 in a .feat file means the first word in [imdb.vocab]
+(the) appears 7 times in that review.
+
+LIBSVM page for details on .feat file format:
+http://www.csie.ntu.edu.tw/~cjlin/libsvm/
+
+We also include [imdbEr.txt] which contains the expected rating for
+each token in [imdb.vocab] as computed by (Potts, 2011). The expected
+rating is a good way to get a sense for the average polarity of a word
+in the dataset.
+
+Citing the dataset
+
+When using this dataset please cite our ACL 2011 paper which
+introduces it. This paper also contains classification results which
+you may want to compare against.
+
+
+@InProceedings{maas-EtAl:2011:ACL-HLT2011,
+  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
+  title     = {Learning Word Vectors for Sentiment Analysis},
+  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
+  month     = {June},
+  year      = {2011},
+  address   = {Portland, Oregon, USA},
+  publisher = {Association for Computational Linguistics},
+  pages     = {142--150},
+  url       = {http://www.aclweb.org/anthology/P11-1015}
+}
+
+References
+
+Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
+David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
+636-659.
+
+Contact
+
+For questions/comments/corrections please contact Andrew Maas
+amaas@cs.stanford.edu
--- a/Questions/Q3/aclImdb/imdb.vocab
+++ b/Questions/Q3/aclImdb/imdb.vocab
--- a/Questions/Q3/aclImdb/test/labeledBow.feat
+++ b/Questions/Q3/aclImdb/test/labeledBow.feat
--- a/Questions/Q3/aclImdb/train/labeledBow.feat
+++ b/Questions/Q3/aclImdb/train/labeledBow.feat
--- a/Questions/Q3/aclImdb/train/unsupBow.feat
+++ b/Questions/Q3/aclImdb/train/unsupBow.feat
--- a/Questions/Q3/aclImdb_v1.tar.gz
+++ b/Questions/Q3/aclImdb_v1.tar.gz
--- a/Questions/Q3/conda_env.yml
+++ b/Questions/Q3/conda_env.yml
@ -0,0 +1,202 @@
+name: tf
+channels:
+  - apple
+  - conda-forge
+dependencies:
+  - appnope=0.1.3=pyhd8ed1ab_0
+  - argon2-cffi=21.3.0=pyhd8ed1ab_0
+  - argon2-cffi-bindings=21.2.0=py39hb18efdd_2
+  - asttokens=2.0.5=pyhd8ed1ab_0
+  - attrs=21.4.0=pyhd8ed1ab_0
+  - backcall=0.2.0=pyh9f0ad1d_0
+  - backports=1.0=py_2
+  - backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
+  - beautifulsoup4=4.11.1=pyha770c72_0
+  - bleach=5.0.0=pyhd8ed1ab_0
+  - brotli=1.0.9=h1c322ee_7
+  - brotli-bin=1.0.9=h1c322ee_7
+  - c-ares=1.18.1=h3422bc3_0
+  - ca-certificates=2022.5.18.1=h4653dfc_0
+  - cached-property=1.5.2=hd8ed1ab_1
+  - cached_property=1.5.2=pyha770c72_1
+  - cffi=1.14.6=py39hda8b47f_0
+  - click=8.1.3=py39h2804cbe_0
+  - colorama=0.4.4=pyh9f0ad1d_0
+  - commonmark=0.9.1=py_0
+  - cycler=0.11.0=pyhd8ed1ab_0
+  - dataclasses=0.8=pyhc8e2a94_3
+  - debugpy=1.6.0=py39h0ef5a74_0
+  - decorator=5.1.1=pyhd8ed1ab_0
+  - defusedxml=0.7.1=pyhd8ed1ab_0
+  - entrypoints=0.4=pyhd8ed1ab_0
+  - executing=0.8.3=pyhd8ed1ab_0
+  - flit-core=3.7.1=pyhd8ed1ab_0
+  - fonttools=4.33.3=py39h9eb174b_0
+  - freetype=2.10.4=h17b34a0_1
+  - future=0.18.2=py39h2804cbe_5
+  - giflib=5.2.1=h27ca646_2
+  - grpcio=1.46.3=py39h518bade_0
+  - h5py=3.6.0=nompi_py39hd982b79_100
+  - hdf5=1.12.1=nompi_hf9525e8_104
+  - importlib-metadata=4.11.4=py39h2804cbe_0
+  - importlib_resources=5.7.1=pyhd8ed1ab_1
+  - ipykernel=6.13.1=py39h32adebf_0
+  - ipython=8.4.0=py39h2804cbe_0
+  - ipython_genutils=0.2.0=py_1
+  - ipywidgets=7.7.0=pyhd8ed1ab_0
+  - jedi=0.18.1=py39h2804cbe_1
+  - jinja2=3.1.2=pyhd8ed1ab_1
+  - joblib=1.1.0=pyhd8ed1ab_0
+  - jpeg=9e=h1c322ee_1
+  - jsonschema=4.6.0=pyhd8ed1ab_0
+  - jupyter=1.0.0=py39h2804cbe_7
+  - jupyter_client=7.3.4=pyhd8ed1ab_0
+  - jupyter_console=6.4.3=pyhd8ed1ab_0
+  - jupyter_core=4.10.0=py39h2804cbe_0
+  - jupyterlab_pygments=0.2.2=pyhd8ed1ab_0
+  - jupyterlab_widgets=1.1.0=pyhd8ed1ab_0
+  - jupyterthemes=0.20.0=py_1
+  - kiwisolver=1.4.2=py39h2c803a9_1
+  - krb5=1.19.3=hf9b2bbe_0
+  - lcms2=2.12=had6a04f_0
+  - lerc=3.0=hbdafb3b_0
+  - lesscpy=0.15.0=pyhd8ed1ab_0
+  - libblas=3.9.0=15_osxarm64_openblas
+  - libbrotlicommon=1.0.9=h1c322ee_7
+  - libbrotlidec=1.0.9=h1c322ee_7
+  - libbrotlienc=1.0.9=h1c322ee_7
+  - libcblas=3.9.0=15_osxarm64_openblas
+  - libcurl=7.83.1=h2fcd78c_0
+  - libcxx=14.0.4=h6a5c8ee_0
+  - libdeflate=1.10=h3422bc3_0
+  - libedit=3.1.20191231=hc8eb9b7_2
+  - libev=4.33=h642e427_1
+  - libffi=3.3=h9f76cd9_2
+  - libgfortran=5.0.0.dev0=11_0_1_hf114ba7_23
+  - libgfortran5=11.0.1.dev0=hf114ba7_23
+  - liblapack=3.9.0=15_osxarm64_openblas
+  - libnghttp2=1.47.0=he723fca_0
+  - libopenblas=0.3.20=openmp_h2209c59_0
+  - libpng=1.6.37=hf7e6567_2
+  - libsodium=1.0.18=h27ca646_1
+  - libssh2=1.10.0=hb80f160_2
+  - libtiff=4.4.0=h2810ee2_0
+  - libwebp=1.2.2=h0d20362_0
+  - libwebp-base=1.2.2=h3422bc3_1
+  - libxcb=1.13=h9b22ae9_1004
+  - libzlib=1.2.12=h90dfc92_0
+  - llvm-openmp=14.0.4=hd125106_0
+  - lz4-c=1.9.3=hbdafb3b_1
+  - markupsafe=2.1.1=py39hb18efdd_1
+  - matplotlib-base=3.5.2=py39h4ee150f_0
+  - matplotlib-inline=0.1.3=pyhd8ed1ab_0
+  - mistune=0.8.4=py39h5161555_1005
+  - munkres=1.1.4=pyh9f0ad1d_0
+  - nbclient=0.5.13=pyhd8ed1ab_0
+  - nbconvert=6.4.5=py39h2804cbe_0
+  - nbformat=5.4.0=pyhd8ed1ab_0
+  - ncurses=6.3=h07bb92c_1
+  - nest-asyncio=1.5.5=pyhd8ed1ab_0
+  - nltk=3.6.7=pyhd8ed1ab_0
+  - notebook=6.4.12=pyha770c72_0
+  - numpy=1.22.4=py39h7df2422_0
+  - openjpeg=2.4.0=h062765e_1
+  - openssl=1.1.1o=ha287fd2_0
+  - packaging=21.3=pyhd8ed1ab_0
+  - pandas=1.4.2=py39hd2dba81_2
+  - pandocfilters=1.5.0=pyhd8ed1ab_0
+  - parso=0.8.3=pyhd8ed1ab_0
+  - patsy=0.5.2=pyhd8ed1ab_0
+  - pexpect=4.8.0=pyh9f0ad1d_2
+  - pickleshare=0.7.5=py_1003
+  - pillow=9.1.1=py39h1b8be2f_1
+  - pip=22.1.2=pyhd8ed1ab_0
+  - ply=3.11=py_1
+  - prometheus_client=0.14.1=pyhd8ed1ab_0
+  - prompt-toolkit=3.0.29=pyha770c72_0
+  - prompt_toolkit=3.0.29=hd8ed1ab_0
+  - psutil=5.9.1=py39h9eb174b_0
+  - pthread-stubs=0.4=h27ca646_1001
+  - ptyprocess=0.7.0=pyhd3deb0d_0
+  - pure_eval=0.2.2=pyhd8ed1ab_0
+  - pycparser=2.21=pyhd8ed1ab_0
+  - pygments=2.12.0=pyhd8ed1ab_0
+  - pyparsing=3.0.9=pyhd8ed1ab_0
+  - pyrsistent=0.18.1=py39hb18efdd_1
+  - python=3.9.0=h4b4120c_5_cpython
+  - python-dateutil=2.8.2=pyhd8ed1ab_0
+  - python-fastjsonschema=2.15.3=pyhd8ed1ab_0
+  - python_abi=3.9=2_cp39
+  - pytz=2022.1=pyhd8ed1ab_0
+  - pyzmq=23.1.0=py39h8faa4b9_0
+  - readline=8.1.2=h46ed386_0
+  - regex=2022.6.2=py39h9eb174b_0
+  - rich=12.4.4=pyhd8ed1ab_0
+  - scikit-learn=1.1.1=py39h255bef5_0
+  - scipy=1.8.1=py39h14896cb_0
+  - seaborn=0.11.2=hd8ed1ab_0
+  - seaborn-base=0.11.2=pyhd8ed1ab_0
+  - send2trash=1.8.0=pyhd8ed1ab_0
+  - setuptools=62.3.4=py39h2804cbe_0
+  - soupsieve=2.3.1=pyhd8ed1ab_0
+  - sqlite=3.38.5=h40dfcc0_0
+  - stack_data=0.2.0=pyhd8ed1ab_0
+  - statsmodels=0.13.2=py39h7b9fbcb_0
+  - tensorflow-deps=2.9.0=0
+  - terminado=0.15.0=py39h2804cbe_0
+  - testpath=0.6.0=pyhd8ed1ab_0
+  - threadpoolctl=3.1.0=pyh8a188c0_0
+  - tk=8.6.12=he1e0b03_0
+  - tornado=6.1=py39hb18efdd_3
+  - tqdm=4.64.0=pyhd8ed1ab_0
+  - traitlets=5.2.2.post1=pyhd8ed1ab_0
+  - typing_extensions=4.2.0=pyha770c72_1
+  - tzdata=2022a=h191b570_0
+  - unicodedata2=14.0.0=py39hb18efdd_1
+  - wcwidth=0.2.5=pyh9f0ad1d_2
+  - webencodings=0.5.1=py_1
+  - wheel=0.37.1=pyhd8ed1ab_0
+  - widgetsnbextension=3.6.0=py39h2804cbe_0
+  - xorg-libxau=1.0.9=h27ca646_0
+  - xorg-libxdmcp=1.1.3=h27ca646_0
+  - xz=5.2.5=h642e427_1
+  - zeromq=4.3.4=hbdafb3b_1
+  - zipp=3.8.0=pyhd8ed1ab_0
+  - zlib=1.2.12=h90dfc92_0
+  - zstd=1.5.2=hd705a24_1
+  - pip:
+    - absl-py==1.1.0
+    - astunparse==1.6.3
+    - cachetools==5.2.0
+    - certifi==2022.5.18.1
+    - charset-normalizer==2.0.12
+    - flatbuffers==1.12
+    - gast==0.4.0
+    - google-auth==2.7.0
+    - google-auth-oauthlib==0.4.6
+    - google-pasta==0.2.0
+    - idna==3.3
+    - keras==2.9.0
+    - keras-preprocessing==1.1.2
+    - libclang==14.0.1
+    - markdown==3.3.7
+    - oauthlib==3.2.0
+    - opt-einsum==3.3.0
+    - protobuf==3.19.4
+    - pyasn1==0.4.8
+    - pyasn1-modules==0.2.8
+    - requests==2.28.0
+    - requests-oauthlib==1.3.1
+    - rsa==4.8
+    - six==1.15.0
+    - tensorboard==2.9.1
+    - tensorboard-data-server==0.6.1
+    - tensorboard-plugin-wit==1.8.1
+    - tensorflow-estimator==2.9.0
+    - tensorflow-macos==2.9.2
+    - tensorflow-metal==0.5.0
+    - termcolor==1.1.0
+    - urllib3==1.26.9
+    - werkzeug==2.1.2
+    - wrapt==1.14.1
+prefix: /Users/xiao_deng/miniforge3/envs/tf
--- a/Questions/Q3/imdb_cls_model.h5
+++ b/Questions/Q3/imdb_cls_model.h5
--- a/Questions/Q3/imdb_reviews_cls.ipynb
+++ b/Questions/Q3/imdb_reviews_cls.ipynb
--- a/Questions/Q3/logs/train/events.out.tfevents.1655115465.sexyMacBookPro.local.19141.6.v2
+++ b/Questions/Q3/logs/train/events.out.tfevents.1655115465.sexyMacBookPro.local.19141.6.v2
--- a/Questions/Q3/logs/validation/events.out.tfevents.1655115533.sexyMacBookPro.local.19141.7.v2
+++ b/Questions/Q3/logs/validation/events.out.tfevents.1655115533.sexyMacBookPro.local.19141.7.v2
--- a/Questions/Q3/word_code_map.pkl
+++ b/Questions/Q3/word_code_map.pkl