Data Science in a nutshell

9 min readMar 21, 2019

“Data Science”, along with the other buzz words

It has been widely discussed in the industry: The term “Data Science” itself is poorly defined. Loads and loads of people want to crash into the field, desperately want to figure out an answer and got overwhelmed with the buzz words like Deep Learning, Machine Learning, Algorithms, Big data, etc.

After all the pitches to convince the investors to pour more and more money to the tech companies using “data science”, maybe now is a good time to do some “Disenchantment”.

Buzz Words in Data Science

To have a more solid discussion, let's list out some of the most popular buzz words of data science field:
1, “Data Science” (OF COURSE)
2, “Big Data”
3, “Analytics”
4, “Business Intelligence”
5, “Machine Learning”
6, “Deep Learning”
7, “Reinforcement Learning”

The list goes on and basically will never end. So let’s just focus on these words first.

“Data Science”

It is a set of disciplines and practices that handle data in a scientific manner. The foundations of data science are mostly Mathematics and Statistics, sometimes Software Engineering would play a role as well.

Usually, it includes but not limited to these steps:
1, Data acquisition
2, Data cleaning & wrangling
3, Data Engineering
4, Utilization of data

1, Data acquisition

To play with data, you need to have data first.

The data could be in the form of .CSV files. Excel file (.XLSX) is a typical form, or it could be available by API calls, or someone has done the dirty work already and they are all in DWH(Data WareHouse) or Data Lake.

When making API calls, the application would usually return the desired data in typical JSON format. Plenty of object-oriented languages comes with Packages & Libraries to parse the JSON into a usable format. (Image from W3School)

In addition, web scraping is also one way to go as well. Web scraping is useful when you need to acquire data that CSV export and API method are not viable (especially for business competitors’ data). My best friend Anastasia Reusova wrote an article for this topic. Free free to check that out if you are interested to learn the details.

2, Data cleaning & wrangling

In the actual working environment, data is rarely ready to be used right away.

Data could be in the nested structure for storage saving purpose (e.g. Bigquery and NoSQL), could be normalized (like general databases). The data need to be flattened before usage.

Or, the application generating data went haywire and polluted the data recently. That would require sanity checking and data cleaning as well.

This is usually the most time-consuming and painful part for Data Scientists, Analysts and Data Engineers. Roughly 60~80% of the data science people’s time goes to this part, 10% goes to meetings & presentation and only the rest for the sexiest part: playing with algorithms.

3, Data Engineering

Simply put, the point of Data Engineering is building a stable and reliable architecture for data storage and consumption. It is a extremely vital part as all kinds of data science tricks depends on it.

If there is no good data to use, NONE of the data science gimmicks could work.

Well engineered data should be clean, fast and cheap (in terms of computational resources, hence money as well) for robust consumption right away.

4, Utilization of data

This is the sexy part that people couldn’t stop talking about.

From traditional consumption like building an application, doing ad hoc analysis, dashboarding and the recent fancy thing such as Machine Learning, this is the step people care so much.

There will be more articles covering this part, stay tuned for more ;)

“Big Data”

There are various definitions online and in Wikipedia. Yet due to the rapidly changing nature of the tech industry itself, it would be much safer to simply define it with the size & structure elements.

Traditional data is mostly tabular data. The most laymen friendly example is a Worksheet in Excel:

As for “Big Data”, there are other formats of data as well: images, videos, voice recordings, etc. Good examples would be Google Photos for images data, Netflix for videos, voices recordings (songs) in Spotify.

And of course, “Big Data” would be BIG. Big enough to make it impossible for a regular laptop to process. Hence, solutions like Google Bigquery, Hadoop and MapReduce are introduced.

“Analytics” & “Business Intelligence”

By nature, “Analytics” and “Business Intelligence” are very alike when it comes to practices in a business context.

Let’s look at the definition on Wikipedia. Here is the definition of “Analytics”:

Analytics is the discovery, interpretation, and communication of meaningful patterns in data; and the process of applying those patterns towards effective decision making

And here is the definition of “Business Intelligence”:

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis of business information. BI technologies provide historical, current and predictive views of business operations.

Couldn’t tell the difference? EXACTLY, because they are not so different in practical terms.

The very essences of “Analytics”/”Business Intelligence”, are dividing the available data in a fair manner, compare them, interrogate the data with questions and interpret the results.

One of the popular tools for data interrogation is SQL (Structured Query Language). SQL comes with a lot of variations including the trendy PostgreSQL, MySQL, MSSQL (which I personally really hate :/), NoSQL, T-SQL and so on.

On the other hand, Python is a nice choice as well. Selection between SQL and Python could vary a lot, and it highly depends on what the desired outcome is. In general, SQL is used for dashboard building due to its fast and simple nature. Python allows the users to be more flexible, the users could extend the usage from just analysis to data cleaning and push the data to a data storage, for instance.

“Machine Learning”

Machine Learning is like human learning new knowledge. Human picks up new knowledge through experience, machines pick up new knowledge with data.

Machine Learning aims at building algorithms, and algorithms can be divided into 2 categories: Supervised Learning and Unsupervised Learning.

An iconic example of a Supervised Learning algorithm would be Linear Regression. And Logistic Regression is widely used as well. But most of the people are usually introduced with Decision Trees first, as it is intuitive for a human being to understand.

Typical Linear Regression: Using X to predict Y, classic formula is: Y= aX + b (where a is a calculated weight for X, and B is a constant)

Logistic Regression would return a number between 0 and 1, which is actually probability. Typical use case: banks predicting how likely is a customer able to repay the loan

Unsupervised Learning is not so widely discussed as Supervised Learning, but it doesn’t mean Unsupervised Learning is rare. KNN clustering and Feature Selection are some good examples of Unsupervised Learning.

K-nearest neighbors clustering, the goal here is to find the similarity and to group them into certain categories

The key difference between Supervised Learning and Unsupervised Learning is involving labeled data or not. In Supervised Learning, labeled data is used to train the algorithm and valid the results; In Unsupervised Learning, data is not labeled and input to the algorithm directly.

There is one thing known as Semi-Supervised Learning. Basically, it uses partly labeled data and partly unlabeled data to do prediction.

“Deep Learning”

Deep Learning has a lot of connections with Machine Learning, but they are not exactly the same thing in practice.

Machine Learning focus on tabular data, and Deep Learning handles untraditional data such as images, text, voice recordings. Deep Learning algorithms work on computer vision, speech recognition, NLP (Natural Language Processing) and so on. Famous language learning app Duolingo obviously uses a lot of Deep Learning to “teach languages better”.

“Reinforcement Learning”

This is a relatively new area in the Machine Learning field. Reinforcement Learning focuses on maximizing rewards by taking action(s) in a given environment.

Personally, I ain’t very familiar with Reinforcement Learning yet. So I would pass the detail explanation for now.

Resources that people in the data science field must know

Here is the list of the resources one in the field must know:

1, Jupyter Notebook
2, Kaggle
3, Github
4, StackOverflow
5, Data Studio/Tableau
6, Quora
7, Medium
8, Google/DuckDuckGo

Jupyter Notebook/JupyterLab

Jupyter Notebook is a very famous and popular tool to do all kind of things to data in an interactive environment (IPython). It supports JUlia, PYThon and R as well.

To use Jupyter Notebook/JupyterLab, it is suggested to install Anaconda first. Anaconda is an environmental management tool that decides what the environment should be for the code to be executed in.

JupyterLab was released 2 years ago and it has only more functionality than Jupyter Notebook. It appears Project Jupyter means to replace the Notebook with JupyterLab in the long run.

Kaggle

Kaggle is known as the dominating platform of data science competitions, public datasets, and Kernels sharing. It also offers learning resources as well. In my point of view, Kaggle is the fastest way to pick up Python and data science skills (Not exactly very solid though LOL).

Kaggle is also my favorite place to go to regarding data science-related topics. So if one wants to stay tuned in the data science field, he/she should definitely spend some time on Kaggle.

Github

Github is a version control tool being broadly used in software engineering. It is used in data warehousing and data science as well.

It is pretty important to get yourself familiar with Github. Data Engineers and Software Engineers use Github a lot to share code and demonstrate their knowledge and skills to the industry as well. Github is considered as a showcase and a portfolio for potential employers as well, it is probably one the most important tool for the tech people.

StackOverflow

Great place for all kinds of technical questions. Whenever there is a question in mind, just throw the keywords there and probably someone already asked before.

Questions appropriate to be asked on StackOverFlow:
1, “How to properly index a table in Redshift?”
2, “What is the Method of making a string to upper case in Python?”
3, “How to integrate GIS package to a PostgreSQL database?”

Data Studio/Tableau

They are both Business Intellengice solutions. Users can build dashboards by using Data Studio and Tableau. Data Studio is totally free to use and Tableau offers Tableau Public, so they are good products for those who wish to learn building dashboards and for NGO use-cases.

Looker and MicroStrategy are in the market as well. They are merging and taking market shares from Tableau step by step.

Quora

Quora is a question & answer platform as well. Unlike StackOverFlow, Quora is more for questions about principles and directions.

Questions suitable for Quora:
1, “What are some algorithms example of Supervised Learning?”
2, “Should I get a Master degree to get a job in data science?”
3, “Is Mathematics background or Computer Science background better to become a Data Scientist?”

Medium

This is exactly the platform you are on now :)

There are plenty of good authors writing good quality content on Medium, data science topic included. You could follow authors like Andrew Ng and Susan Li to keep yourself updated in the industry.

Google/DuckDuckGo

No kidding, search engines are always powerful whenever you have a question in mind. The trick is to know the exact keyword of your problem.

One more thing

Lastly, and the most important thing: Always practice. Only practices (especially those with painful and traumatizing struggles!) would bring you the most memorable learning experience.

Happy adventure in Data Science!

About the Author

Jimmy Pang is a current Business Intelligence Analyst. Specialized in data visualization. Currently pursuing the journey of Data Science.