Data Science with Deep Learning & NLP Advance Techniques Part-1

Devendra Parihar
3 min readFeb 1, 2023

This is a collection of the best Kaggle notebooks (kernels), posts, and other resources (including notebooks (kernels) and posts in discussion from Prize Competition Winners) with Advanced Techniques of Data Science (including NLP) by Deep Learning (DL)

Sources:

— Notebooks (kernels) and posts of the Prize Competition Winners

— Notebooks (kernels) of Kaggle Grandmasters, Masters, or Experts

— Detailed tutorials of the leading Python libraries

etc.

1. Prize Competition Winners: notebooks (kernels) and posts with Magic

1) Image recognition

a) Cassava Leaf Disease Classification

1st Place Solution — Gold Medal, 1st place (Private LB) from 3900

Authors tried a variety of different architectures (e.g., all EfficientNet architectures, Resnet, ResNext, Xception, ViT, DeiT, Inception, and MobileNet) while working with different pre-trained weights (trained e.g. on Imagenet, NoisyStudent, Plant village, iNaturalist…) some of which were available on Tensorflow Hub.

Their final submission first averaged the probabilities of the predicted classes of ViT and ResNext. This averaged probability vector was then merged with the predicted probabilities of EfficientnetB4 and CropNet in the second stage. For this purpose, the values were simply summed up.

Discussion Link

2nd Place Solution — Gold Medal, 2nd place (Private LB) from 3900

Tensorflow_hub, Keras, MobileNet V3, TPU, ~30 epochs.

All explanations are in his post: Post Link

3rd Place Solution — Gold Medal, 3rd place (Private LB) from 3900

  • Ensemble of three ViT models
  • Weighted Averaging
  • 5fold StratifiedKFold
  • Augmentation

All explanations are in his post: Post Link

Silver Medal, 28th place (Private LB), 28th place (Public LB) from 3900

  • Preprocessing, label smoothing,
  • Weighted ensemble of original image inference and augmented one; a weighted average of no TTA and TTA (Test Time Augmentation)
  • Base Model: EfficientNet B4 with Noisy Student, SE-ResNeXt50 (32x4d), Vision Transformer (base patch16)
  • Many tips for handling noisy data and others.

Notebook Link

See the explanation in the author’s GitHub: https://github.com/IMOKURI/Cassava-Leaf-Disease-Classification

b) RANZCR CLiP — Catheter and Line Position Challenge

1st Place Solution Kernels (small ver.) Gold Medal, 1st place (Private LB) from 1547

The author used 4 training stages which is too complex, but the minimal pipeline has only 2 stages.

He published 3 notebooks to demonstrate how our minimal pipeline works.

Stage1: Segmentation (https://www.kaggle.com/haqishen/ranzcr-1st-place-soluiton-seg-model-small-ver)

Stage2: Classification (https://www.kaggle.com/haqishen/ranzcr-1st-place-soluiton-cls-model-small-ver)

Inference (https://www.kaggle.com/haqishen/ranzcr-1st-place-soluiton-inference-small-ver)

Dual-Head Model with 4-stage Training, 2nd Place Solution Gold Medal

Thanks to @steamedsheep, @nvnnghia, @cdeotte, @underwearfitting**

Dual-Head Model with 4-stage Training: Post Link

The final model used Resnet200d, efficient net-b5, and efficient net-b7 as backbones, unet-decoder part is reduced in order to train with decent VRAM usage. More detail sees in the post.

2) Natural Language Processing(NLP)

a) Jigsaw Unintended Bias in Toxicity Classification

Wombat Inference Kernel 4th place (Private LB) from 3165

LSTM, BERT, GPT2CNN, and their (23 model and solution) merging.

Notebook Link

Jigsaw_predict — 8th place (Private LB) from 3165

Pytorch, Multi-Sample Dropout, 4 model and solution:

  • Bert Small V2 29bin 300seq NAUX,
  • Bert Large V2 99bin 250seq,
  • XLNet 9bin 220seq,
  • GPT2 V2 29bin 350seq NAUX

Notebook Link

--

--