Common Pandas Interview Questions for Data Scientists

Common Pandas Interview Questions for Data Scientists

Pandas is one of the most essential libraries for data manipulation in Python, especially for data scientists. If you're preparing for an interview, having a solid understanding of Pandas will help you stand out. Below, we cover some of the most frequently asked Pandas interview questions along with detailed answers and code examples.

1. What is the difference between a Pandas Series and a DataFrame?

A Series is a one-dimensional labeled array that can hold any data type, while a DataFrame is a two-dimensional table with labeled rows and columns.

Example:

import pandas as pd

# Creating a Series
data = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(data)

# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df)

Output:

a 10 
b 20 
c 30 
d 40 
dtype: int64 
  A B 
0 1 4 
1 2 5 
2 3 6

2. How do you select rows in Pandas using .loc and .iloc?

loc[] is used for label-based indexing, while iloc[] is used for position-based indexing.

Example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['x', 'y', 'z'])

# Selecting using loc (label-based)
print(df.loc['x'])

# Selecting using iloc (position-based)
print(df.iloc[0])

Output:

A 1 
B 4 
Name: x, dtype: int64 
A 1 
B 4 
Name: x, dtype: int64

3. How do you check for missing values in a DataFrame?

Pandas provides isna() and isnull() to check for missing values.

Example:

df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})
print(df.isna())  # Check for missing values
print(df.isna().sum())  # Count missing values per column

Output:

  A      B 
0 False False 
1 True False 
2 False True 
A   1 
B   1 
dtype: int64

4. How do you replace missing values in a DataFrame?

Use .fillna() to replace NaN values with a specific value, such as the mean of a column.

Example:

df['A'].fillna(df['A'].mean(), inplace=True)
print(df)

Output:

  A   B 
0 1.0  4.0 
1 2.0  5.0 
2 3.0  NaN

5. How do you sort a DataFrame by multiple columns?

Use .sort_values() with the column names and sorting order.

Example:

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 22], 'Score': [85, 90, 78]})

# Sorting by Age (ascending) and Score (descending)
sorted_df = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print(sorted_df)

Output:

Name Age Score 
2 Charlie 22 78 
0 Alice 25 85 
1 Bob 30 90

6. How do you perform groupby operations in Pandas?

Use .groupby() to group data and apply aggregation functions.

Example:

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'IT'], 'Salary': [50000, 70000, 45000, 80000]})

# Grouping by department and calculating the mean salary
print(df.groupby('Department')['Salary'].mean())

Output:

Department 
HR 47500.0 
IT 75000.0 
Name: Salary, dtype: float64

7. What is the difference between .apply(), .map(), and .applymap()?

.apply() works on rows/columns of a DataFrame.

.map() works element-wise on Series.

.applymap() works element-wise on DataFrame.

Example:

df['Salary'] = df['Salary'].apply(lambda x: x * 1.1)  # Apply function to each element

8. How do you count unique values in a column?

Use .nunique() or value_counts().

Example:

print(df['Department'].nunique())  # Count unique values
print(df['Department'].value_counts())  # Count occurrences of each unique value

Output:

2 
Department 
HR 2 
IT 2 
Name: count, dtype: int64

9. How do you merge two DataFrames?

Use .merge() or .join().

Example:

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Score': [85, 90, 78]})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)

10. How do you convert categorical variables into numeric format?

Use pd.get_dummies() or .astype('category').

Example:

df = pd.DataFrame({'Gender': ['Male', 'Female', 'Male']})
dummies = pd.get_dummies(df, columns=['Gender'])
print(dummies)

11. How do you optimize memory usage in Pandas?

Convert data types to smaller equivalents.

Example:

df['Salary'] = df['Salary'].astype('float32')
print(df.info())

12. How do you work with large datasets in Pandas?

Use chunksize while reading files or work with dask.

Example:

chunk = pd.read_csv('large_data.csv', chunksize=10000)
for part in chunk:
    print(part.head())

Conclusion

Mastering Pandas is crucial for data science interviews. Understanding indexing, grouping, merging, handling missing values, and optimizing performance will help you excel in your role. Keep practicing on real-world datasets and exploring more advanced topics like time-series analysis and performance optimization.

For more Pandas tutorials, check out the official Pandas documentation.

Tags: Data Science, Data Science Basics, Data Scientist Interview, Data Scientist Interview Questions, Interview Preparation, Interview Questions, Machine Learning Interview, Pandas