Common Pandas Interview Questions for Data Scientists
Common Pandas Interview Questions for Data Scientists
Pandas is one of the most essential libraries for data manipulation in Python, especially for data scientists. If you're preparing for an interview, having a solid understanding of Pandas will help you stand out. Below, we cover some of the most frequently asked Pandas interview questions along with detailed answers and code examples.
1. What is the difference between a Pandas Series and a DataFrame?
A Series
is a one-dimensional labeled array that can hold any data type, while a DataFrame
is a two-dimensional table with labeled rows and columns.
Example:
import pandas as pd
# Creating a Series
data = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(data)
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)
Output:
a 10 b 20 c 30 d 40 dtype: int64 A B 0 1 4 1 2 5 2 3 6
2. How do you select rows in Pandas using .loc
and .iloc
?
loc[]
is used for label-based indexing, while iloc[]
is used for position-based indexing.
Example:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])
# Selecting using loc (label-based)
print(df.loc['x'])
# Selecting using iloc (position-based)
print(df.iloc[0])
Output:
A 1 B 4 Name: x, dtype: int64 A 1 B 4 Name: x, dtype: int64
3. How do you check for missing values in a DataFrame?
Pandas provides isna()
and isnull()
to check for missing values.
Example:
df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
print(df.isna()) # Check for missing values
print(df.isna().sum()) # Count missing values per column
Output:
A B 0 False False 1 True False 2 False True A 1 B 1 dtype: int64
4. How do you replace missing values in a DataFrame?
Use .fillna()
to replace NaN
values with a specific value, such as the mean of a column.
Example:
df['A'].fillna(df['A'].mean(), inplace=True)
print(df)
Output:
A B 0 1.0 4.0 1 2.0 5.0 2 3.0 NaN
5. How do you sort a DataFrame by multiple columns?
Use .sort_values()
with the column names and sorting order.
Example:
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'Score': [85, 90, 78]})
# Sorting by Age (ascending) and Score (descending)
sorted_df = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print(sorted_df)
Output:
Name Age Score 2 Charlie 22 78 0 Alice 25 85 1 Bob 30 90
6. How do you perform groupby operations in Pandas?
Use .groupby()
to group data and apply aggregation functions.
Example:
df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'IT'], 'Salary': [50000, 70000, 45000, 80000]})
# Grouping by department and calculating the mean salary
print(df.groupby('Department')['Salary'].mean())
Output:
Department HR 47500.0 IT 75000.0 Name: Salary, dtype: float64
7. What is the difference between .apply()
, .map()
, and .applymap()
?
.apply()
works on rows/columns of a DataFrame.
.map()
works element-wise on Series.
.applymap()
works element-wise on DataFrame.
Example:
df['Salary'] = df['Salary'].apply(lambda x: x * 1.1) # Apply function to each element
8. How do you count unique values in a column?
Use .nunique()
or value_counts()
.
Example:
print(df['Department'].nunique()) # Count unique values
print(df['Department'].value_counts()) # Count occurrences of each unique value
Output:
2 Department HR 2 IT 2 Name: count, dtype: int64
9. How do you merge two DataFrames?
Use .merge()
or .join()
.
Example:
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Score': [85, 90, 78]})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)
10. How do you convert categorical variables into numeric format?
Use pd.get_dummies()
or .astype('category')
.
Example:
df = pd.DataFrame({'Gender': ['Male', 'Female', 'Male']})
dummies = pd.get_dummies(df, columns=['Gender'])
print(dummies)
11. How do you optimize memory usage in Pandas?
Convert data types to smaller equivalents.
Example:
df['Salary'] = df['Salary'].astype('float32')
print(df.info())
12. How do you work with large datasets in Pandas?
Use chunksize
while reading files or work with dask
.
Example:
chunk = pd.read_csv('large_data.csv', chunksize=10000)
for part in chunk:
print(part.head())
Conclusion
Mastering Pandas is crucial for data science interviews. Understanding indexing, grouping, merging, handling missing values, and optimizing performance will help you excel in your role. Keep practicing on real-world datasets and exploring more advanced topics like time-series analysis and performance optimization.
For more Pandas tutorials, check out the official Pandas documentation.
Tags: Data Science, Data Science Basics, Data Scientist Interview, Data Scientist Interview Questions, Interview Preparation, Interview Questions, Machine Learning Interview, Pandas