Pandas Apply: How to Use .apply() on DataFrames and Series
Updated on
Every data analysis project hits a point where built-in pandas methods are not enough. You need to normalize phone numbers with country-specific rules, categorize rows using business logic that lives in your head, or compute a derived column that depends on three other columns at once. Writing a for loop over a DataFrame feels wrong -- and it is. The pandas apply() method bridges the gap between pandas' built-in vectorized operations and the arbitrary Python functions you need to run on your data.
The problem is that apply() is misused more often than it is used correctly. Developers reach for it out of habit when a vectorized operation would run 50x faster. Others avoid it entirely because they heard it is slow, missing cases where apply is the right and only tool.
This guide covers the full picture: how apply() works on both Series and DataFrames, what the axis parameter actually controls, how to pass extra arguments, when to use lambda vs named functions, the critical performance differences compared to vectorized alternatives, and real-world examples you can drop into production code.
What .apply() Does
The apply() method runs a function on each element of a Series, or on each row or column of a DataFrame. It takes your custom function, calls it repeatedly, and collects the results into a new Series or DataFrame.
Series.apply() Signature
Series.apply(func, convert_dtype=True, args=(), **kwargs)| Parameter | Description | Default |
|---|---|---|
func | Function to apply to each element | Required |
convert_dtype | Try to infer better dtype for results | True |
args | Positional arguments to pass after the element | () |
**kwargs | Additional keyword arguments passed to func | -- |
DataFrame.apply() Signature
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)| Parameter | Description | Default |
|---|---|---|
func | Function to apply along the specified axis | Required |
axis | 0 = apply to each column, 1 = apply to each row | 0 |
raw | Pass ndarray instead of Series to function (faster) | False |
result_type | 'expand', 'reduce', or 'broadcast' for controlling output shape | None |
args | Positional arguments to pass after the column/row | () |
**kwargs | Additional keyword arguments passed to func | -- |
The return type depends on the function: if it returns a scalar, apply() produces a Series. If it returns a Series, apply() produces a DataFrame.
Sample Data for All Examples
Every code example in this guide uses the following DataFrame. Copy this block first and run it in your notebook.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice Johnson', 'bob smith', 'CHARLIE BROWN', 'Diana Prince', 'Eve Torres'],
'email': ['alice@company.com', 'BOB@Test.Co', ' charlie@gmail.com ', 'diana@company.com', 'eve@unknown'],
'salary': [75000, 82000, 65000, 91000, 70000],
'department': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'Marketing'],
'years_exp': [5, 8, 3, 12, 6],
'rating': [4.2, 3.8, 4.5, 4.9, 3.1]
})
print(df)Output:
name email salary department years_exp rating
0 Alice Johnson alice@company.com 75000 Engineering 5 4.2
1 bob smith BOB@Test.Co 82000 Marketing 8 3.8
2 CHARLIE BROWN charlie@gmail.com 65000 Engineering 3 4.5
3 Diana Prince diana@company.com 91000 Sales 12 4.9
4 Eve Torres eve@unknown 70000 Marketing 6 3.1Apply on a Series
Series.apply() runs a function on every element in a single column. It is the most common form of apply and the simplest to understand.
Simple Transformations
# Convert salaries from annual to monthly
df['monthly_salary'] = df['salary'].apply(lambda x: x / 12)
print(df[['name', 'salary', 'monthly_salary']])Output:
name salary monthly_salary
0 Alice Johnson 75000 6250.000000
1 bob smith 82000 6833.333333
2 CHARLIE BROWN 65000 5416.666667
3 Diana Prince 91000 7583.333333
4 Eve Torres 70000 5833.333333String Operations
# Proper case for names
df['name_clean'] = df['name'].apply(lambda x: x.strip().title())
print(df[['name', 'name_clean']])Output:
name name_clean
0 Alice Johnson Alice Johnson
1 bob smith Bob Smith
2 CHARLIE BROWN Charlie Brown
3 Diana Prince Diana Prince
4 Eve Torres Eve TorresConditional Logic
# Classify employees by experience level
def experience_level(years):
if years < 4:
return 'Junior'
elif years < 8:
return 'Mid'
else:
return 'Senior'
df['level'] = df['years_exp'].apply(experience_level)
print(df[['name_clean', 'years_exp', 'level']])Output:
name_clean years_exp level
0 Alice Johnson 5 Mid
1 Bob Smith 8 Senior
2 Charlie Brown 3 Junior
3 Diana Prince 12 Senior
4 Eve Torres 6 MidUsing map() or dict for Simple Lookups
When the transformation is a direct value-to-value mapping, Series.map() is cleaner and often faster:
# map with a dictionary
dept_codes = {'Engineering': 'ENG', 'Marketing': 'MKT', 'Sales': 'SLS'}
df['dept_code'] = df['department'].map(dept_codes)
print(df[['department', 'dept_code']])Output:
department dept_code
0 Engineering ENG
1 Marketing MKT
2 Engineering ENG
3 Sales SLS
4 Marketing MKTApply on a DataFrame
DataFrame.apply() passes an entire column or an entire row to your function, depending on the axis parameter. This is where many developers get confused.
axis=0: Column-Wise (Default)
With axis=0, pandas passes each column as a Series to your function. The result is a Series indexed by column names.
# Get the range (max - min) of each numeric column
def column_range(col):
return col.max() - col.min()
ranges = df[['salary', 'years_exp', 'rating']].apply(column_range, axis=0)
print(ranges)Output:
salary 26000.0
years_exp 9.0
rating 1.8
dtype: float64Another example -- compute multiple statistics per column:
def column_stats(col):
return pd.Series({
'mean': col.mean(),
'std': col.std(),
'min': col.min(),
'max': col.max()
})
stats = df[['salary', 'years_exp', 'rating']].apply(column_stats, axis=0)
print(stats)Output:
salary years_exp rating
mean 76600.00 6.80 4.10
std 9939.82 3.27 0.70
min 65000.00 3.00 3.10
max 91000.00 12.00 4.90axis=1: Row-Wise
With axis=1, pandas passes each row as a Series to your function. Each Series has the column names as its index, so you access columns by name.
# Calculate a performance bonus based on rating and salary
def calc_bonus(row):
if row['rating'] >= 4.5:
return row['salary'] * 0.15
elif row['rating'] >= 4.0:
return row['salary'] * 0.10
else:
return row['salary'] * 0.05
df['bonus'] = df.apply(calc_bonus, axis=1)
print(df[['name_clean', 'salary', 'rating', 'bonus']])Output:
name_clean salary rating bonus
0 Alice Johnson 75000 4.2 7500.0
1 Bob Smith 82000 3.8 4100.0
2 Charlie Brown 65000 4.5 9750.0
3 Diana Prince 91000 4.9 13650.0
4 Eve Torres 70000 3.1 3500.0A Clear Mental Model for axis
This table clears up confusion around the axis parameter:
| Parameter | What is passed to func | Direction | Typical use |
|---|---|---|---|
axis=0 | Each column (as a Series) | Vertical -- down the rows | Aggregations (mean, range, custom stats) |
axis=1 | Each row (as a Series) | Horizontal -- across the columns | Row-level logic using multiple columns |
Think of it this way: axis=0 collapses along axis 0 (rows), so the function receives a column. axis=1 collapses along axis 1 (columns), so the function receives a row.
Lambda Functions with apply
Lambda functions are anonymous, inline functions defined with the lambda keyword. They work well for short, one-line transformations.
Common Lambda Patterns
# Numeric transformation
df['salary_k'] = df['salary'].apply(lambda x: f"${x/1000:.0f}K")
# Boolean flag
df['high_performer'] = df['rating'].apply(lambda x: x >= 4.0)
# String extraction -- get first name
df['first_name'] = df['name_clean'].apply(lambda x: x.split()[0])
print(df[['name_clean', 'salary_k', 'high_performer', 'first_name']])Output:
name_clean salary_k high_performer first_name
0 Alice Johnson $75K True Alice
1 Bob Smith $82K False Bob
2 Charlie Brown $65K True Charlie
3 Diana Prince $91K True Diana
4 Eve Torres $70K False EveMulti-Column Access in Row-Wise Lambda
When you need data from multiple columns, use axis=1 and access the row like a dictionary:
# Create a summary string from multiple columns
df['summary'] = df.apply(
lambda row: f"{row['name_clean']} ({row['department']}) - {row['level']}",
axis=1
)
print(df['summary'])Output:
0 Alice Johnson (Engineering) - Mid
1 Bob Smith (Marketing) - Senior
2 Charlie Brown (Engineering) - Junior
3 Diana Prince (Sales) - Senior
4 Eve Torres (Marketing) - Mid
Name: summary, dtype: objectWhen Lambda Gets Too Long, Use a Named Function
If your lambda spans multiple lines or needs if/else chains, switch to a named function. The rule is simple: if you cannot read the lambda in 5 seconds, write a proper function.
# Too complex for lambda -- use a named function
def compensation_tier(row):
base = row['salary']
exp = row['years_exp']
rate = row['rating']
score = (base / 10000) + (exp * 2) + (rate * 5)
if score > 40:
return 'Tier 1'
elif score > 30:
return 'Tier 2'
else:
return 'Tier 3'
df['comp_tier'] = df.apply(compensation_tier, axis=1)
print(df[['name_clean', 'salary', 'years_exp', 'rating', 'comp_tier']])Output:
name_clean salary years_exp rating comp_tier
0 Alice Johnson 75000 5 4.2 Tier 2
1 Bob Smith 82000 8 3.8 Tier 1
2 Charlie Brown 65000 3 4.5 Tier 2
3 Diana Prince 91000 12 4.9 Tier 1
4 Eve Torres 70000 6 3.1 Tier 2Passing Extra Arguments
Sometimes your function needs additional parameters beyond the element or row. Use args for positional arguments and **kwargs for keyword arguments.
Positional Arguments with args
def apply_raise(salary, pct_raise, min_raise):
"""Apply percentage raise with a minimum floor."""
raise_amount = salary * (pct_raise / 100)
return salary + max(raise_amount, min_raise)
# 5% raise with minimum $4000
df['new_salary'] = df['salary'].apply(apply_raise, args=(5, 4000))
print(df[['name_clean', 'salary', 'new_salary']])Output:
name_clean salary new_salary
0 Alice Johnson 75000 79000.0
1 Bob Smith 82000 86100.0
2 Charlie Brown 65000 69000.0
3 Diana Prince 91000 95550.0
4 Eve Torres 70000 74000.0Keyword Arguments with **kwargs
def format_salary(salary, currency='USD', decimals=0):
"""Format salary with currency symbol."""
symbols = {'USD': '$', 'EUR': '\u20ac', 'GBP': '\u00a3'}
symbol = symbols.get(currency, currency)
return f"{symbol}{salary:,.{decimals}f}"
# Pass keyword arguments directly
df['salary_formatted'] = df['salary'].apply(
format_salary, currency='USD', decimals=2
)
print(df[['name_clean', 'salary_formatted']])Output:
name_clean salary_formatted
0 Alice Johnson $75,000.00
1 Bob Smith $82,000.00
2 Charlie Brown $65,000.00
3 Diana Prince $91,000.00
4 Eve Torres $70,000.00Combining args and kwargs
def compute_tax(salary, tax_rate, deduction=0):
"""Calculate tax after deductions."""
taxable = max(salary - deduction, 0)
return round(taxable * tax_rate, 2)
# tax_rate via args, deduction via kwargs
df['tax'] = df['salary'].apply(compute_tax, args=(0.22,), deduction=12000)
print(df[['name_clean', 'salary', 'tax']])Output:
name_clean salary tax
0 Alice Johnson 75000 13860.0
1 Bob Smith 82000 15400.0
2 Charlie Brown 65000 11660.0
3 Diana Prince 91000 17380.0
4 Eve Torres 70000 12760.0apply() vs map() vs applymap() vs transform()
Pandas has several methods that look similar. Choosing the wrong one leads to slow code or unexpected results. Here is the definitive comparison.
Comparison Table
| Method | Works On | Input to Function | Returns | Primary Use |
|---|---|---|---|---|
Series.apply() | One column | Each element | Series | Custom element-wise transformation |
Series.map() | One column | Each element (or dict/Series for lookup) | Series | Value substitution, simple mapping |
DataFrame.apply(axis=0) | DataFrame | Each column (as Series) | Series or DataFrame | Column-level aggregation |
DataFrame.apply(axis=1) | DataFrame | Each row (as Series) | Series or DataFrame | Row-level logic across columns |
DataFrame.map() | DataFrame | Each element | DataFrame | Element-wise on entire DataFrame (pandas 2.1+) |
DataFrame.applymap() | DataFrame | Each element | DataFrame | Same as DataFrame.map() (deprecated in 2.1) |
DataFrame.transform() | DataFrame | Each column or group | Same shape as input | Must return same-length output; used with groupby |
When to Use Each
# 1. Series.apply() -- custom logic per element
df['salary_grade'] = df['salary'].apply(
lambda x: 'A' if x > 85000 else 'B' if x > 70000 else 'C'
)
# 2. Series.map() -- direct substitution from a dictionary
grade_labels = {'A': 'Executive', 'B': 'Manager', 'C': 'Staff'}
df['grade_label'] = df['salary_grade'].map(grade_labels)
# 3. DataFrame.apply(axis=0) -- aggregate each column
summary = df[['salary', 'years_exp']].apply(np.mean, axis=0)
# 4. DataFrame.apply(axis=1) -- row-wise logic
df['score'] = df.apply(
lambda r: r['rating'] * 10 + r['years_exp'], axis=1
)
# 5. DataFrame.map() -- element-wise on entire DataFrame (pandas >= 2.1)
numeric_df = df[['salary', 'years_exp']].map(lambda x: x * 1.1)
# 6. transform() -- must preserve shape, often with groupby
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
print(df[['name_clean', 'salary', 'salary_grade', 'grade_label', 'score', 'dept_avg_salary']])Output:
name_clean salary salary_grade grade_label score dept_avg_salary
0 Alice Johnson 75000 B Manager 47.0 70000.0
1 Bob Smith 82000 B Manager 46.0 76000.0
2 Charlie Brown 65000 C Staff 48.0 70000.0
3 Diana Prince 91000 A Executive 61.0 91000.0
4 Eve Torres 70000 C Staff 37.0 76000.0Key Differences at a Glance
- apply vs map:
map()accepts dictionaries and Series for lookups.apply()only accepts callables. For simple mappings,map()is more readable and slightly faster. - apply vs transform:
transform()must return output with the same length as input.apply()can return any shape. Usetransform()insidegroupby()when you want per-group computations broadcast back to original rows. - apply vs applymap:
applymap()is deprecated since pandas 2.1. UseDataFrame.map()instead for element-wise operations on an entire DataFrame.
Real-World Examples
Data Cleaning Pipeline
This example demonstrates a typical data cleaning workflow where apply handles operations that lack clean vectorized alternatives.
import re
# Raw messy data
raw_df = pd.DataFrame({
'date_str': ['2024-01-15', 'Jan 20, 2024', '15/02/2024', '2024.03.10', 'March 5 2024'],
'phone': ['(555) 123-4567', '555.987.6543', '555 456 7890', '+1-555-222-3333', '5551114444'],
'amount': ['$1,234.56', '2345.67', '$890', '1,100.00', '$45.5'],
'status': [' Active ', 'INACTIVE', 'active', 'Pending', ' ACTIVE']
})
print("Before cleaning:")
print(raw_df)# Step 1: Normalize dates
def parse_date(date_str):
"""Try multiple date formats."""
formats = ['%Y-%m-%d', '%b %d, %Y', '%d/%m/%Y', '%Y.%m.%d', '%B %d %Y']
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt)
except ValueError:
continue
return pd.NaT
raw_df['date_parsed'] = raw_df['date_str'].apply(parse_date)
# Step 2: Normalize phone numbers
def clean_phone(phone):
"""Extract digits and format consistently."""
digits = re.sub(r'\D', '', phone)
if len(digits) == 11 and digits.startswith('1'):
digits = digits[1:] # Remove country code
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
return phone # Return original if unexpected format
raw_df['phone_clean'] = raw_df['phone'].apply(clean_phone)
# Step 3: Parse currency amounts
def parse_amount(amount_str):
"""Remove currency symbols and commas, convert to float."""
cleaned = re.sub(r'[$,]', '', str(amount_str))
try:
return float(cleaned)
except ValueError:
return np.nan
raw_df['amount_parsed'] = raw_df['amount'].apply(parse_amount)
# Step 4: Normalize status (this one is better vectorized)
raw_df['status_clean'] = raw_df['status'].str.strip().str.lower()
print("\nAfter cleaning:")
print(raw_df[['date_parsed', 'phone_clean', 'amount_parsed', 'status_clean']])Output:
After cleaning:
date_parsed phone_clean amount_parsed status_clean
0 2024-01-15 (555) 123-4567 1234.56 active
1 2024-01-20 (555) 987-6543 2345.67 inactive
2 2024-02-15 (555) 456-7890 890.00 active
3 2024-03-10 (555) 222-3333 1100.00 pending
4 2024-03-05 (555) 111-4444 45.50 activeNotice that step 4 uses .str accessor methods (vectorized) instead of apply -- that is deliberate. Use apply only where vectorized methods fall short. For a deeper dive into handling NaN results from cleaning operations, see the pandas fillna guide.
Feature Engineering for Machine Learning
# Employee data for ML feature engineering
ml_df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
'salary': [75000, 82000, 65000, 91000, 70000, 95000, 58000, 88000],
'years_exp': [5, 8, 3, 12, 6, 15, 2, 10],
'department': ['Eng', 'Mkt', 'Eng', 'Sales', 'Mkt', 'Eng', 'Sales', 'Eng'],
'rating': [4.2, 3.8, 4.5, 4.9, 3.1, 4.0, 3.5, 4.7],
'last_promotion_years_ago': [1, 3, 0, 2, 4, 1, 5, 2]
})
# Feature 1: Salary per year of experience
ml_df['salary_per_exp'] = ml_df.apply(
lambda r: r['salary'] / max(r['years_exp'], 1), axis=1
)
# Feature 2: Composite engagement score
def engagement_score(row):
"""Combine rating, tenure, and promotion recency into a single score."""
rating_score = row['rating'] / 5.0 # Normalize to 0-1
tenure_score = min(row['years_exp'] / 15.0, 1.0) # Cap at 15 years
promo_score = max(1 - (row['last_promotion_years_ago'] / 5.0), 0)
# Weighted combination
return round(rating_score * 0.5 + tenure_score * 0.3 + promo_score * 0.2, 3)
ml_df['engagement'] = ml_df.apply(engagement_score, axis=1)
# Feature 3: Salary deviation from department average
dept_avg = ml_df.groupby('department')['salary'].transform('mean')
ml_df['salary_vs_dept'] = ((ml_df['salary'] - dept_avg) / dept_avg * 100).round(1)
# Feature 4: Risk flag combining multiple signals
def attrition_risk(row):
"""Flag employees at risk of leaving."""
risk_factors = 0
if row['rating'] < 3.5:
risk_factors += 1
if row['last_promotion_years_ago'] >= 4:
risk_factors += 1
if row['salary_vs_dept'] < -10:
risk_factors += 1
if row['years_exp'] >= 5 and row['last_promotion_years_ago'] >= 3:
risk_factors += 1
if risk_factors >= 3:
return 'High'
elif risk_factors >= 2:
return 'Medium'
else:
return 'Low'
ml_df['risk'] = ml_df.apply(attrition_risk, axis=1)
print(ml_df[['name', 'salary_per_exp', 'engagement', 'salary_vs_dept', 'risk']])Output:
name salary_per_exp engagement salary_vs_dept risk
0 Alice 15000.00 0.683 -3.9 Low
1 Bob 10250.00 0.580 7.9 Low
2 Charlie 21666.67 0.680 -16.7 Low
3 Diana 7583.33 0.870 0.0 Low
4 Eve 11666.67 0.550 -7.9 Medium
5 Frank 6333.33 0.700 21.8 Low
6 Grace 29000.00 0.420 -37.4 High
7 Henry 8800.00 0.801 12.8 LowReturning Multiple Columns from apply
When a function produces several outputs, return a pd.Series and pandas expands it into columns automatically. You can then combine the result with the original DataFrame using pd.concat().
def salary_breakdown(row):
"""Break salary into components."""
gross = row['salary']
federal_tax = gross * 0.22
state_tax = gross * 0.05
insurance = 500 * 12 # $500/month
retirement = gross * 0.06
net = gross - federal_tax - state_tax - insurance - retirement
return pd.Series({
'federal_tax': federal_tax,
'state_tax': state_tax,
'insurance': insurance,
'retirement': retirement,
'net_pay': net
})
breakdown = ml_df.apply(salary_breakdown, axis=1)
result = pd.concat([ml_df[['name', 'salary']], breakdown], axis=1)
print(result)Output:
name salary federal_tax state_tax insurance retirement net_pay
0 Alice 75000 16500.0 3750.0 6000.0 4500.0 44250.0
1 Bob 82000 18040.0 4100.0 6000.0 4920.0 48940.0
2 Charlie 65000 14300.0 3250.0 6000.0 3900.0 37550.0
3 Diana 91000 20020.0 4550.0 6000.0 5460.0 54970.0
4 Eve 70000 15400.0 3500.0 6000.0 4200.0 40900.0
5 Frank 95000 20900.0 4750.0 6000.0 5700.0 57650.0
6 Grace 58000 12760.0 2900.0 6000.0 3480.0 32860.0
7 Henry 88000 19360.0 4400.0 6000.0 5280.0 52960.0The result_type Parameter
When DataFrame.apply() returns list-like objects from each row, the result_type parameter controls the output format.
| result_type | Behavior | When to use |
|---|---|---|
None (default) | Pandas infers the output type | General purpose |
'expand' | List-like results become separate columns | When function returns a list or tuple |
'reduce' | Return Series instead of DataFrame if possible | Aggregation that returns scalars |
'broadcast' | Output has same shape as input DataFrame | When you want element-level results broadcast to all rows |
# Without result_type -- returns a Series of lists
result_default = df[['salary', 'years_exp']].apply(
lambda row: [row['salary'], row['salary'] * 1.1],
axis=1
)
print("Default:")
print(result_default)
print(type(result_default.iloc[0]))# With result_type='expand' -- splits list into columns
result_expand = df[['salary', 'years_exp']].apply(
lambda row: [row['salary'], row['salary'] * 1.1],
axis=1,
result_type='expand'
)
result_expand.columns = ['current', 'projected']
print("\nExpand:")
print(result_expand)Output:
Expand:
current projected
0 75000 82500.0
1 82000 90200.0
2 65000 71500.0
3 91000 100100.0
4 70000 77000.0Using apply() with groupby()
Combining groupby() with apply() runs your function on each group independently -- a powerful pattern for group-level transformations.
# Normalize salary within each department (z-score)
def zscore_group(group):
return (group - group.mean()) / group.std()
df['salary_zscore'] = df.groupby('department')['salary'].apply(zscore_group)
print(df[['name_clean', 'department', 'salary', 'salary_zscore']])# Custom group aggregation
def department_report(group):
return pd.Series({
'headcount': len(group),
'avg_salary': group['salary'].mean(),
'avg_rating': group['rating'].mean(),
'top_performer': group.loc[group['rating'].idxmax(), 'name_clean']
})
report = df.groupby('department').apply(department_report)
print(report)Output:
headcount avg_salary avg_rating top_performer
department
Engineering 2 70000.0 4.35 Charlie Brown
Marketing 2 76000.0 3.45 Bob Smith
Sales 1 91000.0 4.90 Diana PrinceUsing Progress Bars with tqdm
Long-running apply operations can feel like a black box. The tqdm library adds a progress bar with one line of code.
from tqdm import tqdm
# Register tqdm with pandas
tqdm.pandas()
# Now use progress_apply instead of apply
# df['result'] = df['column'].progress_apply(slow_function)
# Example with a simulated slow function
import time
def slow_transform(value):
time.sleep(0.01) # Simulate processing time
return value * 2
# Shows: 100%|##########| 5/5 [00:00<00:00, 98.33it/s]
# df['doubled'] = df['salary'].progress_apply(slow_transform)Performance: When NOT to Use apply()
This section is critical. The apply() method executes a Python function in a loop, one element or row at a time. Pandas' built-in operations run in compiled C code, making them 10-100x faster.
Benchmark: apply vs Vectorized
import time
# Create a large DataFrame
n = 500_000
large_df = pd.DataFrame({
'a': np.random.randn(n),
'b': np.random.randn(n),
'c': np.random.choice(['X', 'Y', 'Z'], n)
})
# Test 1: Simple arithmetic
start = time.time()
r1 = large_df['a'].apply(lambda x: x ** 2 + 1)
t_apply = time.time() - start
start = time.time()
r2 = large_df['a'] ** 2 + 1
t_vec = time.time() - start
print(f"Squaring {n:,} values:")
print(f" apply(): {t_apply:.4f}s")
print(f" vectorized: {t_vec:.4f}s")
print(f" speedup: {t_apply/t_vec:.0f}x")Typical output on a modern machine:
Squaring 500,000 values:
apply(): 0.1523s
vectorized: 0.0018s
speedup: 85x# Test 2: Row-wise operation with axis=1
start = time.time()
r3 = large_df.apply(lambda row: row['a'] + row['b'], axis=1)
t_apply_row = time.time() - start
start = time.time()
r4 = large_df['a'] + large_df['b']
t_vec_row = time.time() - start
print(f"\nRow-wise addition of {n:,} rows:")
print(f" apply(axis=1): {t_apply_row:.4f}s")
print(f" vectorized: {t_vec_row:.4f}s")
print(f" speedup: {t_apply_row/t_vec_row:.0f}x")Typical output:
Row-wise addition of 500,000 rows:
apply(axis=1): 4.8721s
vectorized: 0.0023s
speedup: 2118xRow-wise apply with axis=1 is especially slow because pandas constructs a new Series object for every single row. If you find yourself iterating row by row, also consider whether iterrows() fits your use case -- though apply is generally faster than explicit iteration.
Vectorized Alternatives Cheatsheet
| Operation | Slow (apply) | Fast (vectorized) |
|---|---|---|
| Arithmetic | .apply(lambda x: x * 2) | col * 2 |
| Conditional | .apply(lambda x: 'A' if x > 10 else 'B') | np.where(col > 10, 'A', 'B') |
| Multiple conditions | .apply(complex_if_elif) | np.select([cond1, cond2], [val1, val2], default) |
| String case | .apply(lambda x: x.lower()) | .str.lower() |
| String contains | .apply(lambda x: 'abc' in x) | .str.contains('abc') |
| String split | .apply(lambda x: x.split(',')[0]) | .str.split(',').str[0] |
| Date year | .apply(lambda x: x.year) | .dt.year |
| Date formatting | .apply(lambda x: x.strftime('%Y-%m')) | .dt.strftime('%Y-%m') |
| Clipping values | .apply(lambda x: min(max(x, 0), 100)) | .clip(0, 100) |
| Absolute value | .apply(abs) | .abs() or np.abs(col) |
| Rounding | .apply(lambda x: round(x, 2)) | .round(2) |
Multi-Condition Example: np.select vs apply
# SLOW: Using apply with if/elif
def categorize_slow(row):
if row['a'] > 1 and row['b'] > 1:
return 'Both High'
elif row['a'] > 1:
return 'A High'
elif row['b'] > 1:
return 'B High'
else:
return 'Neither'
start = time.time()
large_df['cat_slow'] = large_df.apply(categorize_slow, axis=1)
t_slow = time.time() - start
# FAST: Using np.select
conditions = [
(large_df['a'] > 1) & (large_df['b'] > 1),
large_df['a'] > 1,
large_df['b'] > 1,
]
choices = ['Both High', 'A High', 'B High']
start = time.time()
large_df['cat_fast'] = np.select(conditions, choices, default='Neither')
t_fast = time.time() - start
print(f"apply(axis=1): {t_slow:.3f}s")
print(f"np.select: {t_fast:.3f}s")
print(f"Speedup: {t_slow/t_fast:.0f}x")Typical output:
apply(axis=1): 8.234s
np.select: 0.025s
Speedup: 329xWhen apply() IS Justified
Despite the performance cost, apply is the right choice in these situations:
- Complex logic with no vectorized equivalent: Functions calling external APIs, parsing nested JSON, or implementing business rules with many branches.
- Small DataFrames: Under ~10,000 rows, the speed difference is negligible and readability matters more.
- Prototyping: Write it with apply first, optimize to vectorized later if performance matters.
- Functions returning multiple values: When you need to produce several new columns from complex row-level logic, returning a
pd.Seriesfrom apply is clean and practical. - Try/except error handling: Vectorized operations raise errors on the entire column. Apply lets you handle exceptions per element.
The raw Parameter for Speed
If your function only needs the numeric values (not column names or index), set raw=True to pass NumPy arrays instead of Series objects. This avoids the overhead of constructing a Series for each call.
# Without raw: pandas constructs a Series for each row
start = time.time()
result_series = large_df[['a', 'b']].apply(lambda row: row[0] + row[1], axis=1, raw=False)
t_series = time.time() - start
# With raw: pandas passes a numpy array
start = time.time()
result_raw = large_df[['a', 'b']].apply(lambda row: row[0] + row[1], axis=1, raw=True)
t_raw = time.time() - start
print(f"raw=False: {t_series:.3f}s")
print(f"raw=True: {t_raw:.3f}s")
print(f"Speedup: {t_series/t_raw:.1f}x")Typical output:
raw=False: 4.872s
raw=True: 0.943s
Speedup: 5.2xSetting raw=True gives a 3-5x speedup when you do not need index or column labels. Still slower than full vectorization, but a useful middle ground.
Handling Errors Inside apply
Real-world data is messy. Your function will encounter NaN values, unexpected types, and malformed strings. Wrapping logic in try/except prevents one bad row from crashing the entire operation.
messy_data = pd.DataFrame({
'value': ['42', '3.14', 'N/A', '100', '', None, 'abc', '7.5']
})
def safe_float(val):
"""Convert to float with error handling."""
try:
return float(val)
except (ValueError, TypeError):
return np.nan
messy_data['parsed'] = messy_data['value'].apply(safe_float)
print(messy_data)Output:
value parsed
0 42 42.00
1 3.14 3.14
2 N/A NaN
3 100 100.00
4 NaN
5 None NaN
6 abc NaN
7 7.5 7.50Debugging apply Functions
Test your function on a single element before applying it to the entire DataFrame:
# Step 1: Test on one row
test_row = df.iloc[0]
print("Test input:", test_row.to_dict())
result = calc_bonus(test_row)
print("Test output:", result)
# Step 2: Test on a small slice
small_result = df.head(3).apply(calc_bonus, axis=1)
print("Small batch:", small_result.tolist())
# Step 3: Apply to full DataFrame
df['bonus'] = df.apply(calc_bonus, axis=1)Common Errors and Fixes
TypeError: 'float' object is not iterable
This happens when your function returns a scalar but pandas expects a sequence, or vice versa.
# ERROR: Returning a list when pandas expects a scalar
def broken(x):
return [x, x * 2] # Returns list, assigned to single column
# This creates a column of lists, not two columns
df['result'] = df['salary'].apply(broken)
# Each cell is [salary, salary*2] -- probably not what you want
# FIX: Use result_type='expand' on DataFrame.apply
result = df[['salary']].apply(
lambda row: [row['salary'], row['salary'] * 2],
axis=1,
result_type='expand'
)
result.columns = ['original', 'doubled']SettingWithCopyWarning
This occurs when you apply on a slice of a DataFrame rather than the original.
# WARNING: Operating on a slice
subset = df[df['department'] == 'Engineering']
subset['bonus'] = subset.apply(calc_bonus, axis=1) # SettingWithCopyWarning
# FIX: Use .loc or .copy()
subset = df[df['department'] == 'Engineering'].copy()
subset['bonus'] = subset.apply(calc_bonus, axis=1) # Clean
# Or use .loc on the original DataFrame
df.loc[df['department'] == 'Engineering', 'bonus'] = (
df[df['department'] == 'Engineering'].apply(calc_bonus, axis=1)
)KeyError When Accessing Columns in axis=1
When using axis=1, each row is a Series with column names as index. A typo in the column name raises a KeyError.
# ERROR: Typo in column name
def broken_func(row):
return row['salry'] # Typo: should be 'salary'
# df.apply(broken_func, axis=1) # KeyError: 'salry'
# FIX: Print column names first
print(df.columns.tolist())
# ['name', 'email', 'salary', 'department', 'years_exp', 'rating', ...]
# Or use .get() for safe access
def safe_func(row):
return row.get('salary', 0) * row.get('years_exp', 1)Function Returns Wrong Shape
When your function returns inconsistent types (sometimes scalar, sometimes Series), pandas cannot build the result properly.
# ERROR: Inconsistent return types
def inconsistent(row):
if row['salary'] > 80000:
return pd.Series({'bonus': 5000, 'flag': True})
else:
return 0 # Returns scalar for some rows, Series for others
# FIX: Always return the same type
def consistent(row):
if row['salary'] > 80000:
return pd.Series({'bonus': 5000, 'flag': True})
else:
return pd.Series({'bonus': 0, 'flag': False})
result = df.apply(consistent, axis=1)
print(result)Visualize Your Transformed Data
After applying transformations, explore the results visually to validate your logic and uncover patterns. PyGWalker (opens in a new tab) turns any pandas DataFrame into a Tableau-like drag-and-drop interface right inside Jupyter notebooks -- no chart code needed.
import pygwalker as pyg
# After applying transformations, visualize interactively
walker = pyg.walk(df)With PyGWalker you can:
- Compare original vs transformed columns with side-by-side charts
- Validate conditional logic by filtering and grouping transformed data
- Spot outliers in computed fields through distribution plots
- Build charts by dragging columns -- no matplotlib or seaborn syntax required
For an even smoother experience, try RunCell (opens in a new tab) -- an AI-powered Jupyter environment where an agent can help you write apply functions, debug errors, and generate visualizations through natural language. It is especially useful when experimenting with complex row-wise transformations.
Summary: apply() Decision Flowchart
Use this mental checklist every time you reach for apply():
- Is there a built-in pandas method? (
.str,.dt, arithmetic,np.where,np.select) -- Use it. 10-100x faster. - Is it a simple value mapping? -- Use
Series.map()with a dictionary. - Do you need the same shape back with groupby? -- Use
transform(). - Is the logic complex, multi-branched, or involves try/except? -- Use
apply(). It is the right tool here. - Is the DataFrame small (under 10k rows)? -- Use
apply(). Readability wins. - Is it a row-wise operation on a large DataFrame? -- Think twice. Can you restructure with vectorized ops? If not, set
raw=Truefor a partial speedup.
FAQ
Is pandas apply slow?
Yes, compared to vectorized operations. apply() runs a Python loop internally, which is 10-100x slower than pandas' built-in C-optimized methods. For simple math, string operations, or conditionals, always prefer vectorized alternatives like np.where(), .str accessor, or direct column arithmetic. Apply is appropriate for complex logic that has no vectorized equivalent, or for DataFrames under 10,000 rows where the difference is negligible.
What is the difference between apply and map in pandas?
Series.apply() takes any callable function and runs it on each element. Series.map() also accepts dictionaries and other Series for value substitution. For lookups (mapping one value to another), map() is more readable and often faster. For custom transformation logic, apply() is the correct choice. On DataFrames, apply() can process entire rows or columns, while map() (replacing the deprecated applymap()) works element-by-element.
How do I use apply with axis=1 to access multiple columns?
Set axis=1 and the function receives each row as a Series. Access columns by name: df.apply(lambda row: row['col_a'] + row['col_b'], axis=1). For named functions, use the same pattern: def func(row): return row['price'] * row['quantity']. Note that axis=1 is significantly slower than vectorized operations, so prefer df['col_a'] + df['col_b'] when possible.
Can I pass extra arguments to the function in apply?
Yes. Use the args parameter for positional arguments and keyword arguments for named ones: df['col'].apply(func, args=(10, 20), multiplier=1.5). Inside the function, the first parameter is the element (or row/column), followed by the positional args, then keyword args.
How do I return multiple columns from a single apply call?
Return a pd.Series with named indices from your function: return pd.Series({'col1': val1, 'col2': val2}). When used with axis=1, pandas automatically expands the result into a DataFrame with those column names. Alternatively, return a list and use result_type='expand' to split it into columns.
When should I use apply vs transform in pandas?
Use transform() when you need the output to have the exact same shape as the input -- this is common with groupby() for broadcasting group-level statistics back to individual rows (e.g., df.groupby('dept')['salary'].transform('mean')). Use apply() when the output shape can differ from the input, such as returning a single summary row per group or computing derived columns with complex logic.
Related Guides
- Pandas GroupBy: Aggregation, Transform, Apply -- use apply() with groupby for group-level transformations
- Pandas Sort Values -- sort rows by the columns you created with apply
- Pandas Reset Index -- clean up the index after filtering or applying transformations
- Pandas iterrows: When and How to Iterate Over Rows -- compare apply vs explicit row iteration
- Pandas Merge: Combining DataFrames -- join DataFrames on shared columns
- Pandas fillna: Handle Missing Values -- fill NaN values produced by apply operations
- Pandas value_counts: Count Unique Values -- quick frequency counts before or after transformations