When you are done with text cleaning then the next important step is to convert the text data to numbers, as the machine understands numbers only. There are many ways to convert text to vectors- Count vectorizer, TFIDF vectorizer, and Word Embeddings. In this, we will be discussing Count Vectorizer and TFIDF and their calculations:

COUNT VECTORIZER (or ONE HOT ENCODING): This is the basic technique to represent the text data numerically. In this, the unique words(called features) are extracted from the text and then the frequency of each word in the text is calculated. The features extracted from the…

Using the nltk library.

NLTK is a string processing library that takes strings as input. The output is in the form of either a string or lists of strings. This library provides a lot of algorithms that helps majorly in the learning purpose. One can compare among different variants of outputs. There are other libraries as well like spaCy, CoreNLP, PyNLPI, Polyglot. NLTK and spaCy are most widely used. Spacy works well with large information and for advanced NLP.

To get an understanding of the basic text cleaning processes I’m using the nltk library which, I think, is good for…

Other than 3V’s!

Image for post
Image for post

Well, it is common and you all must be aware that Big Data is mainly defined by 3V’s i.e, variety, velocity, and volume.
VOLUME: The amount of data is huge.
Contains multiple forms of data.
VELOCITY: Huge amount of streaming data is continuously analyzed in near real-time.
But there is something more that differentiates big data from small data. Let’s have a look over these:

GOAL: Small data helps to accomplish a single task by analyzing the data. Whereas in Big data, the goal evolves and redirects to some unexpected situations. …

Image for post
Image for post

What’s your first reaction when you are given a problem? Well, you will go through the problem statement and try to be responsive with quick solutions. Well, it’s good to be quick and put up your views. But somewhere this quick recommendation will lag to provide good results, or sometimes creates a new and a bigger problem. So, from where this lag and chances of developing new problems are coming from? This is because you start solving it without stopping to think before doing it. …

You got a data-set and you’re ready to start with model training and predictions, but wait! Is this data ready to train the algorithm? Well you all might know the answer, a big NO. So, what all things are there to make our data ready for model building? Here are the steps that are to be followed and are often referred to as Data Preprocessing techniques:

  1. Importing the libraries
  2. Reading the data set
  3. Checking and handling missing values
  4. Encoding techniques
  5. Feature Scaling
  6. Splitting the data set

Now let’s go to all these one by one:


Yash Joshi

A young and dynamic learner with the focus to gain knowledge in the data-driven world.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store