THƯ VIỆN TỰ HỌC DEEP LEARNING NATURAL LANGUAGE PROCESSING (NLP)
General NLP tools/libraries
- HuggingFace Transformers – State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0: https://github.com/huggingface/transformers
- Stanza – A Python NLP Package for Many Human Languages: https://stanfordnlp.github.io/stanza
- NLTK – Natural Language Toolkit: https://www.nltk.org/
- Spacy – Industrial-Strength Natural Language Processing: https://spacy.io
- Flair NLP: https://github.com/flairNLP/flair
- HuggingFace datasets and evaluation metrics: https://github.com/huggingface/datasets
- AdapterHub Repo for pre-trained adapter modules: https://adapterhub.ml/
- Facebook platform for training and evaluating dialogue models: https://parl.ai/
- Facebook AI Seq2Seq library: https://github.com/pytorch/fairseq
- OpenNMT for Machine Translation: https://opennmt.net/
- NVIDIA conversational AI library based on Pytorch Lightning: https://github.com/NVIDIA/NeMo
- Stanford NLP group: https://nlp.stanford.edu/software/
- Rasa: Open source conversational AI: https://rasa.com/
- DeepPavlov: Open source conversational AI Framework: https://deeppavlov.ai/
- FastText – library for efficient text classification and representation learning: https://fasttext.cc
- NVIDIA’s Megatron for training large scale LM: ttps://github.com/NVIDIA/Megatron-LM
- Xatkit – The easiest way to build powerful bots and chatbots: https://github.com/xatkit-bot-platform/xatkit
- GenSim – Library for topic modeling, document indexing and similarity retrieval with large corpora: https://github.com/RaRe-Technologies/gensim
- CoreNLP: library aims to make the application of linguistic analysis tools to a piece of text easy and efficient: https://github.com/stanfordnlp/CoreNLP
- OpenNLP: A powerful tool with a lot of features and ready for production workloads if you’re using Java: https://opennlp.apache.org/
- TextBrewer – A Pytorch-based toolkit for NLP containing different distillation methods: https://github.com/airaria/TextBrewer
General NLP datasets/benchmarks
- Stanford Question Answering Dataset (SQuAD): https://rajpurkar.github.io/SQuAD-explorer/
- General Language Understanding Evaluation (GLUE) benchmark: https://gluebenchmark.com/
- Machine Translation (WMT): http://www.statmt.org/wmt20/
- NLP Progress: http://nlpprogress.com/
- Quora Question Pairs: https://www.kaggle.com/c/quora-question-pairs
- [datasets] The Multi-Genre NLI Corpus (MultiNLI): https://cims.nyu.edu/~sbowman/multinli/ (A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference)
- [datasets] Textual Entailment Resource Pool (RTE): https://aclweb.org/aclwiki/Textual_Entailment_Resource_Pool
- [datasets] The WikiText Long Term Dependency Language Modeling Dataset: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
- [datasets] Open WebText – an open-source effort to reproduce OpenAI’s WebText dataset: https://skylion007.github.io/OpenWebTextCorpus/
- SemEval https://semeval.github.io/
- Metatex curated NLP datasets
- Dataset list – Natural Language Processing: https://metatext.io
- [Dataset] Asian Language Treebank (ALT) project (13 parallel corpora. Treebank of some languages): https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/
- [Website] The Big Bad NLP Database: https://datasets.quantumstat.com/
- [Website] Word Sense Disambiguation Dataset: http://danlou.github.io/uwa/
- [Website] NLP Datasets by Niederhoffer [Github, 4400 starts] https://github.com/niderhoff/nlp-datasets
- [Website] 25 Best Parallel Text Datasets for Machine Translation Training https://lionbridge.ai/datasets/25-best-parallel-text-datasets-for-machine-translation-training/
- [Website] 20 Best German Language Datasets for Machine Learning https://lionbridge.ai/datasets/20-best-german-language-datasets-for-machine-learning/
- XTREME – a multi-task benchmark evaluating cross-lingual generalization of multilingual representations: https://github.com/google-research/xtreme
General NLP learning resources
- Stanford Natural Language Processing with Deep Learning (Free): http://web.stanford.edu/class/cs224n/
- Online version-XCS224N (Fee): Natural Language Processing with Deep Learning | Stanford Online
- Deeplearning.ai Natural Language Processing Specialization (Fee/financial aid available): https://www.deeplearning.ai/natural-language-processing-specialization/
- CS124 – From Language to Information(Free): From Languages to Information – YouTube
- BERT tutorials by Chris McCormick (Fee):
- The BERT Collection:
- Neural Network for NLP – Graham Neubig CMU: http://phontron.com/class/nn4nlp2020/schedule.html
- Multilingual NLP – Graham Neubig CMU: http://demo.clab.cs.cmu.edu/11737fa20/
- [Book] Neural Network Method for Natural Language Processing (Y. Golberg): https://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037
- [Book] Cross-lingual Word Embedding (Sebastian Ruder): https://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=1419
- [Book] Speech and Language Processing (3rd ed. draft): https://web.stanford.edu/~jurafsky/slp3/
- [Book] Speech and Language Processing, 2nd Edition: https://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210
- [Book] Foundations of Statistical Natural Language Processing: https://nlp.stanford.edu/fsnlp/
- [Book] Natural Language Processing (draft 2018): http://cseweb.ucsd.edu/~nnakashole/teaching/eisenstein-nov18.pdf
- [Book] Introduction to Information Retrieval (Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze): https://nlp.stanford.edu/IR-book/information-retrieval-book.html
- [Blog] 100 days of NLP: https://github.com/graviraja/100-Days-of-NLP/tree/master/architectures
- [Website] The Natural Language Processing Dictionary: http://www.cse.unsw.edu.au/~billw/nlpdict.html
- [Blog] Lil’Log: https://lilianweng.github.io/lil-log/
- [Blog] Sebastian Ruder: https://ruder.io/
- [Blog] Jay Alammar: https://jalammar.github.io
- [Blog] Chris McCormick: https://Mccormickml.com
- [Blog] Google AI blog: Google AI Blog
- [Blog] Facebook AI blog: https://research.fb.com/blog/
- [Blog] Salesforce Research blog: https://www.salesforce.com/research/
- [Blog] Jlibovicky’s blog (about MT): https://jlibovicky.github.io
- [Book] Real-world Natural Language Processing: https://drive.google.com/file/d/1nkCXR_AtOMV7NSmV3azj2petTYmWTVeT/view?usp=sharing
- [Blog] FloydHub Blog: https://blog.floydhub.com/
- [Blog] Marek Kei Blog – Thoughts on ML and NLP: http://www.marekrei.com/blog/
- [Website] paperswithcode.com: https://paperswithcode.com/area/natural-language-processing
- [Website] Trending on EMNLP-2020: https://emnlp-2020.herokuapp.com/
- [Album] NLP-Papers: https://github.com/gyunggyung/NLP-Papers
- [Album] 100 Must-Read NLP Papers: https://github.com/mhagiwara/100-nlp-papers
- [Album] The Best NLP Papers From ICLR 2020:
- [Slide] From perceptrons to word embeddings: https://drive.google.com/file/d/111AAxCQsr8uVsInyrEPxIbSkOBBipkGX/view?usp=sharing
- [Slide] Intel Natural Language Processing Course: https://software.intel.com/content/www/us/en/develop/training/course-natural-language-processing.html
- [Book] Natural Language Processing with Python: https://www.nltk.org/book/
- [Book] Practical Natural Language Processing: http://www.practicalnlp.ai/
- [Course] Extension to NLP course at Yandex School of Data Analysis: https://lena-voita.github.io/nlp_course.html
- [Album] A collection of 400+ Survey Papers on NLP and ML: https://github.com/NiuTrans/ABigSurvey
- [Website] The Super Duper NLP Repo (Demos in Colab): https://notebooks.quantumstat.com/
- [Website] The Model Forge (Models with respective source URLs): https://models.quantumstat.com/
- [Book-Draft] Embeddings in Natural Language Processing: http://josecamachocollados.com/book_embNLP_draft.pdf
- Top conferences such as: EMNLP, ACL, NAACL, …
Vietnamese NLP tools / libraries
- VNCoreNLP
- vncorenlp/VnCoreNLP: A Vietnamese natural language processing toolkit (NAACL 2018)
- PhoBERT: https://github.com/VinAIResearch/PhoBERT
- vELECTRA: https://github.com/fpt-corp/vELECTRA
- Underthesea: https://github.com/undertheseanlp/underthesea
- Pyvi: https://github.com/trungtv/pyvi
- Coccoc tokenizer: https://github.com/coccoc/coccoc-tokenizer
- NK-VECTOR (NLP with JS): https://github.com/trinhdoduyhungss/nk-vector
- VNTK (NLP with JS): https://github.com/vunb/vntk
- VietChunker (chunking): https://vlsp.hpda.vn/demo/?page=resources
- ETNLP (extract, evaluate, visualize multiple embeddings): https://github.com/vietnlp/etnlp
- PhoNLP – a BERT-based multitask learning toolkit for Vietnamese POS tagging, NER and DP: https://github.com/VinAIResearch/PhoNLP
Vietnamese NLP datasets / benchmarks
- NLP Progress Vietnamese: https://github.com/undertheseanlp/NLP-Vietnamese-progress
- VLSP: https://vlsp.org.vn/
- Vietnamese Text2SQL: https://github.com/VinAIResearch/ViText2SQL
- NLP@UIT Research Group: https://sites.google.com/uit.edu.vn/uit-nlp/datasets-projects
- [Dataset] Vietnamese German Dataset: https://www.kaggle.com/flightstar/vietnamese-german-dataset
Vietnamese NLP learning resources
- 100 exercises of NLP: https://github.com/minhpqn/nlp_100_drill_exercises_ver_2020
- Top 100 NLP Questions: https://drive.google.com/file/d/1L_9FKt10dWnzTnM0DJdQrU3Esf1f5_c5/view?bclid=IwAR1ixGmWLu7Yw6vd75rCLOSNFDrpeZIdrlKJxlPzUOIl2rzkLje_wckzOAE
NLP Research Group
- [USA] University of Washington, Homepage: https://www.cs.washington.edu/research/nlp
- uwnlp: https://github.com/uwnlp
- [Germany] Technical University of Darmstadt: https://www.informatik.tu-darmstadt.de/ukp/ukp_home/index.en.jsp
- UKPLab: https://github.com/UKPLab
- [UK] University of Edinburgh: https://www.ed.ac.uk/
- EdinburghNLP: https://github.com/EdinburghNLP
- [USA] Harvard University
- Homepage: Harvard NLP
- HNLP (github.com)
- [Singapore] Nanyang Technological University, Homepage: https://ntunlpsg.github.io/
(Nguồn VietAI)
Hits: 0