Data Science Interview Questions 2023
- What is NumPy?
NumPy is a Python library for fast numerical computations. It provides high-performance, reliable, and scalable array functions. NumPy arrays can be used as an alternative to lists in many situations.
- What is the advantage of NumPy arrays over lists?
The advantage of using NumPy over lists is that they use less memory and are faster than lists.
They support multi-dimensional arrays, unlike lists which only help one-dimensional arrays.
They can be sliced or reshaped using the standard Python expression syntax for slicing and reshaping. In contrast, slices on lists must be done with special operators such as lambda or list comprehensions.
- Differentiate between univariate, bivariate, and multivariate analysis.
A univariate analysis is a data set that contains only one variable. The data set can be categorical or numerical.
The bivariate analysis combines two variables into one large dataset that allows you to make detailed comparisons between them (e.g., the difference in the mean response for each treatment group across all participants).
In multivariate analysis, there can be many more variables than in either univariate or bivariate analysis (usually, there is at least one variable per factor). The primary purpose of using multiple variables is to increase the accuracy of our statement about what we believe to be true about our dataset.
- What is the difference between the use of iloc and loc?
The difference between the use of iloc and loc is that iloc returns a row object that is a list of integer values. In contrast, loc returns a row object that has one column. In addition, the row object returned by iloc has a position set to 0, which means it starts at index 0. On the other hand, the row object returned by loc has no starting index, so there are no integers in the first column (the one containing the values).
- What is the difference between the Pandas series and Pandas Dataframe?
Pandas Dataframe is a data storage format for tabular data, which can be efficiently stored in memory. The Pandas library provides a high-level interface to manipulate and analyze tabular data. At the same time, the underlying data structure is stored in a data frame.
df = pd.DataFrame()
Pandas Series, On the other hand, Pandas Series is a multi-dimensional array that can store many different types of objects such as arrays, matrices, and lists. A pandas series is one dimensional with N rows and M columns where N can be any positive integer and M can be any positive integer or an empty list ([]). In other words, it is just like a list but without indexing.
s = pd.Series ()
- What are the ACID properties in SQL?
Atomicity: A transaction is defined as a set of operations that must be carried out without any partial effects or side effects. It means there should only be data updated in the database after the transaction has been committed.
Consistency: The database should be consistent at all times, i.e., all updates to the data must be visible to other users and processes.
Isolation: Each transaction in a database is isolated from other transactions so that they do not touch each other's data while running concurrently.
Durability: Each transaction should leave the database in a consistent state after its completion unless explicitly told otherwise by its owner.
- Difference between DDL and DML
DDL stands for Data Definition Language, while DML stands for Data Manipulation Language. The main difference between these two languages is that the first one is used to define the data. In contrast, the second one is used to manipulate it. In other words, DDL is used to create tables, whereas DML is used to modify existing records in a table.
DDL- CREATE, ALTER, DROP
DML- INSERT, UPDATE, DELETE
- What are Constraints?
SQL constraints are used to limit the type of data that can go into a table, ensuring the accuracy and reliability of the data in the table. Constraints can be either column-level or table-level. Column-level constraints apply to a single column, while table-level constraints apply to the whole table.
- Difference between Join and Union
Join combines two different tables, each of which has a select list containing a single column. This can be done by using the join() function. It returns a view that combines all of the rows from both tables.
Union is used to combine multiple columns from a single table into one row. For example, the union() function does this by taking all of the selected columns from the first table and combining them into one row in the second table.
- What are Nested Triggers?
Nested Triggers are a feature of SQL Server that allows you to create a trigger that runs when the same statement is fired more than once. This can be useful in situations where you want to modify data based on an event but need only to process one row at a time.
- What is a Confusion Matrix?
The Confusion Matrix is a table that summarizes prediction results. It is used to describe the performance of a classification model. The Confusion Matrix is an n*n matrix that evaluates how well an algorithm predicts certain dataset features.
- What is the difference between long-format data and wide-format data?
A wide format is a data structure that allows for storing much more information than a long format. The main difference between the two is that wide format uses more bytes to keep the same amount of data as long format. This can make it harder to move around since you will be wasting more space on your hard drive or in memory if you use a wide format.
Long formats are generally used when you want to save space and speed up your computer, but wide formats are used when storing more data in your computer's memory or hard drive.
- Why is Python used for Data Cleaning in DS?
Python is used for Data Cleaning in data science because it can perform some of the essential cleaning and transformation operations without additional dependencies.
Python has excellent support for Pandas and NumPy library - a set of mathematical and statistical routines used for data manipulation and analysis. The extensive list of libraries available for Python also helps to achieve quick results when needed.
- What is a normal distribution?
The normal distribution is a continuous probability distribution that can be used to model various random variables. It is the most commonly used probability distribution and the most important in statistics, economics, and finance.
The normal distribution is a particular case of the Gaussian distribution: it has the same mean and variance, but the variance is twice as significant. The normal distribution functions as a bell curve when plotted along one axis and has an area under it equal to 1.
- What is logistic regression?
Logistic regression is a statistical technique to find the best-fitting model for a given set of observed data. For example, it can predict the probability of an event, such as whether a customer will buy your product.
The main idea behind logistic regression is to find the best-fitting model for your dataset, determining how many variables are needed to describe your data. The model you choose will depend on your dataset's complexity and how complex it needs to be for you to make reliable predictions.
Comments (0)
Write a Comment
Your email address will not be published. Required fields are marked (*)