Skip to content

Pandas Apply: How to Use .apply() on DataFrames and Series

Updated on

Every data analysis project hits a point where built-in pandas methods are not enough. You need to normalize phone numbers with country-specific rules, categorize rows using business logic that lives in your head, or compute a derived column that depends on three other columns at once. Writing a for loop over a DataFrame feels wrong -- and it is. The pandas apply() method bridges the gap between pandas' built-in vectorized operations and the arbitrary Python functions you need to run on your data.

The problem is that apply() is misused more often than it is used correctly. Developers reach for it out of habit when a vectorized operation would run 50x faster. Others avoid it entirely because they heard it is slow, missing cases where apply is the right and only tool.

This guide covers the full picture: how apply() works on both Series and DataFrames, what the axis parameter actually controls, how to pass extra arguments, when to use lambda vs named functions, the critical performance differences compared to vectorized alternatives, and real-world examples you can drop into production code.

📚

What .apply() Does

The apply() method runs a function on each element of a Series, or on each row or column of a DataFrame. It takes your custom function, calls it repeatedly, and collects the results into a new Series or DataFrame.

Series.apply() Signature

Series.apply(func, convert_dtype=True, args=(), **kwargs)
ParameterDescriptionDefault
funcFunction to apply to each elementRequired
convert_dtypeTry to infer better dtype for resultsTrue
argsPositional arguments to pass after the element()
**kwargsAdditional keyword arguments passed to func--

DataFrame.apply() Signature

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
ParameterDescriptionDefault
funcFunction to apply along the specified axisRequired
axis0 = apply to each column, 1 = apply to each row0
rawPass ndarray instead of Series to function (faster)False
result_type'expand', 'reduce', or 'broadcast' for controlling output shapeNone
argsPositional arguments to pass after the column/row()
**kwargsAdditional keyword arguments passed to func--

The return type depends on the function: if it returns a scalar, apply() produces a Series. If it returns a Series, apply() produces a DataFrame.

Sample Data for All Examples

Every code example in this guide uses the following DataFrame. Copy this block first and run it in your notebook.

import pandas as pd
import numpy as np
 
df = pd.DataFrame({
    'name': ['Alice Johnson', 'bob smith', 'CHARLIE BROWN', 'Diana Prince', 'Eve Torres'],
    'email': ['alice@company.com', 'BOB@Test.Co', ' charlie@gmail.com ', 'diana@company.com', 'eve@unknown'],
    'salary': [75000, 82000, 65000, 91000, 70000],
    'department': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'Marketing'],
    'years_exp': [5, 8, 3, 12, 6],
    'rating': [4.2, 3.8, 4.5, 4.9, 3.1]
})
 
print(df)

Output:

             name                email  salary   department  years_exp  rating
0   Alice Johnson    alice@company.com   75000  Engineering          5     4.2
1       bob smith          BOB@Test.Co   82000    Marketing          8     3.8
2   CHARLIE BROWN   charlie@gmail.com    65000  Engineering          3     4.5
3    Diana Prince    diana@company.com   91000        Sales         12     4.9
4      Eve Torres          eve@unknown   70000    Marketing          6     3.1

Apply on a Series

Series.apply() runs a function on every element in a single column. It is the most common form of apply and the simplest to understand.

Simple Transformations

# Convert salaries from annual to monthly
df['monthly_salary'] = df['salary'].apply(lambda x: x / 12)
print(df[['name', 'salary', 'monthly_salary']])

Output:

             name  salary  monthly_salary
0   Alice Johnson   75000     6250.000000
1       bob smith   82000     6833.333333
2   CHARLIE BROWN   65000     5416.666667
3    Diana Prince   91000     7583.333333
4      Eve Torres   70000     5833.333333

String Operations

# Proper case for names
df['name_clean'] = df['name'].apply(lambda x: x.strip().title())
print(df[['name', 'name_clean']])

Output:

             name      name_clean
0   Alice Johnson   Alice Johnson
1       bob smith       Bob Smith
2   CHARLIE BROWN   Charlie Brown
3    Diana Prince    Diana Prince
4      Eve Torres      Eve Torres

Conditional Logic

# Classify employees by experience level
def experience_level(years):
    if years < 4:
        return 'Junior'
    elif years < 8:
        return 'Mid'
    else:
        return 'Senior'
 
df['level'] = df['years_exp'].apply(experience_level)
print(df[['name_clean', 'years_exp', 'level']])

Output:

      name_clean  years_exp   level
0  Alice Johnson          5     Mid
1      Bob Smith          8  Senior
2  Charlie Brown          3  Junior
3   Diana Prince         12  Senior
4     Eve Torres          6     Mid

Using map() or dict for Simple Lookups

When the transformation is a direct value-to-value mapping, Series.map() is cleaner and often faster:

# map with a dictionary
dept_codes = {'Engineering': 'ENG', 'Marketing': 'MKT', 'Sales': 'SLS'}
df['dept_code'] = df['department'].map(dept_codes)
print(df[['department', 'dept_code']])

Output:

    department dept_code
0  Engineering       ENG
1    Marketing       MKT
2  Engineering       ENG
3        Sales       SLS
4    Marketing       MKT

Apply on a DataFrame

DataFrame.apply() passes an entire column or an entire row to your function, depending on the axis parameter. This is where many developers get confused.

axis=0: Column-Wise (Default)

With axis=0, pandas passes each column as a Series to your function. The result is a Series indexed by column names.

# Get the range (max - min) of each numeric column
def column_range(col):
    return col.max() - col.min()
 
ranges = df[['salary', 'years_exp', 'rating']].apply(column_range, axis=0)
print(ranges)

Output:

salary       26000.0
years_exp        9.0
rating           1.8
dtype: float64

Another example -- compute multiple statistics per column:

def column_stats(col):
    return pd.Series({
        'mean': col.mean(),
        'std': col.std(),
        'min': col.min(),
        'max': col.max()
    })
 
stats = df[['salary', 'years_exp', 'rating']].apply(column_stats, axis=0)
print(stats)

Output:

          salary  years_exp  rating
mean   76600.00       6.80    4.10
std     9939.82       3.27    0.70
min    65000.00       3.00    3.10
max    91000.00      12.00    4.90

axis=1: Row-Wise

With axis=1, pandas passes each row as a Series to your function. Each Series has the column names as its index, so you access columns by name.

# Calculate a performance bonus based on rating and salary
def calc_bonus(row):
    if row['rating'] >= 4.5:
        return row['salary'] * 0.15
    elif row['rating'] >= 4.0:
        return row['salary'] * 0.10
    else:
        return row['salary'] * 0.05
 
df['bonus'] = df.apply(calc_bonus, axis=1)
print(df[['name_clean', 'salary', 'rating', 'bonus']])

Output:

      name_clean  salary  rating    bonus
0  Alice Johnson   75000     4.2   7500.0
1      Bob Smith   82000     3.8   4100.0
2  Charlie Brown   65000     4.5   9750.0
3   Diana Prince   91000     4.9  13650.0
4     Eve Torres   70000     3.1   3500.0

A Clear Mental Model for axis

This table clears up confusion around the axis parameter:

ParameterWhat is passed to funcDirectionTypical use
axis=0Each column (as a Series)Vertical -- down the rowsAggregations (mean, range, custom stats)
axis=1Each row (as a Series)Horizontal -- across the columnsRow-level logic using multiple columns

Think of it this way: axis=0 collapses along axis 0 (rows), so the function receives a column. axis=1 collapses along axis 1 (columns), so the function receives a row.

Lambda Functions with apply

Lambda functions are anonymous, inline functions defined with the lambda keyword. They work well for short, one-line transformations.

Common Lambda Patterns

# Numeric transformation
df['salary_k'] = df['salary'].apply(lambda x: f"${x/1000:.0f}K")
 
# Boolean flag
df['high_performer'] = df['rating'].apply(lambda x: x >= 4.0)
 
# String extraction -- get first name
df['first_name'] = df['name_clean'].apply(lambda x: x.split()[0])
 
print(df[['name_clean', 'salary_k', 'high_performer', 'first_name']])

Output:

      name_clean salary_k  high_performer first_name
0  Alice Johnson     $75K            True      Alice
1      Bob Smith     $82K           False        Bob
2  Charlie Brown     $65K            True    Charlie
3   Diana Prince     $91K            True      Diana
4     Eve Torres     $70K           False        Eve

Multi-Column Access in Row-Wise Lambda

When you need data from multiple columns, use axis=1 and access the row like a dictionary:

# Create a summary string from multiple columns
df['summary'] = df.apply(
    lambda row: f"{row['name_clean']} ({row['department']}) - {row['level']}",
    axis=1
)
print(df['summary'])

Output:

0    Alice Johnson (Engineering) - Mid
1        Bob Smith (Marketing) - Senior
2    Charlie Brown (Engineering) - Junior
3       Diana Prince (Sales) - Senior
4       Eve Torres (Marketing) - Mid
Name: summary, dtype: object

When Lambda Gets Too Long, Use a Named Function

If your lambda spans multiple lines or needs if/else chains, switch to a named function. The rule is simple: if you cannot read the lambda in 5 seconds, write a proper function.

# Too complex for lambda -- use a named function
def compensation_tier(row):
    base = row['salary']
    exp = row['years_exp']
    rate = row['rating']
 
    score = (base / 10000) + (exp * 2) + (rate * 5)
 
    if score > 40:
        return 'Tier 1'
    elif score > 30:
        return 'Tier 2'
    else:
        return 'Tier 3'
 
df['comp_tier'] = df.apply(compensation_tier, axis=1)
print(df[['name_clean', 'salary', 'years_exp', 'rating', 'comp_tier']])

Output:

      name_clean  salary  years_exp  rating comp_tier
0  Alice Johnson   75000          5     4.2    Tier 2
1      Bob Smith   82000          8     3.8    Tier 1
2  Charlie Brown   65000          3     4.5    Tier 2
3   Diana Prince   91000         12     4.9    Tier 1
4     Eve Torres   70000          6     3.1    Tier 2

Passing Extra Arguments

Sometimes your function needs additional parameters beyond the element or row. Use args for positional arguments and **kwargs for keyword arguments.

Positional Arguments with args

def apply_raise(salary, pct_raise, min_raise):
    """Apply percentage raise with a minimum floor."""
    raise_amount = salary * (pct_raise / 100)
    return salary + max(raise_amount, min_raise)
 
# 5% raise with minimum $4000
df['new_salary'] = df['salary'].apply(apply_raise, args=(5, 4000))
print(df[['name_clean', 'salary', 'new_salary']])

Output:

      name_clean  salary  new_salary
0  Alice Johnson   75000     79000.0
1      Bob Smith   82000     86100.0
2  Charlie Brown   65000     69000.0
3   Diana Prince   91000     95550.0
4     Eve Torres   70000     74000.0

Keyword Arguments with **kwargs

def format_salary(salary, currency='USD', decimals=0):
    """Format salary with currency symbol."""
    symbols = {'USD': '$', 'EUR': '\u20ac', 'GBP': '\u00a3'}
    symbol = symbols.get(currency, currency)
    return f"{symbol}{salary:,.{decimals}f}"
 
# Pass keyword arguments directly
df['salary_formatted'] = df['salary'].apply(
    format_salary, currency='USD', decimals=2
)
print(df[['name_clean', 'salary_formatted']])

Output:

      name_clean salary_formatted
0  Alice Johnson      $75,000.00
1      Bob Smith      $82,000.00
2  Charlie Brown      $65,000.00
3   Diana Prince      $91,000.00
4     Eve Torres      $70,000.00

Combining args and kwargs

def compute_tax(salary, tax_rate, deduction=0):
    """Calculate tax after deductions."""
    taxable = max(salary - deduction, 0)
    return round(taxable * tax_rate, 2)
 
# tax_rate via args, deduction via kwargs
df['tax'] = df['salary'].apply(compute_tax, args=(0.22,), deduction=12000)
print(df[['name_clean', 'salary', 'tax']])

Output:

      name_clean  salary      tax
0  Alice Johnson   75000  13860.0
1      Bob Smith   82000  15400.0
2  Charlie Brown   65000  11660.0
3   Diana Prince   91000  17380.0
4     Eve Torres   70000  12760.0

apply() vs map() vs applymap() vs transform()

Pandas has several methods that look similar. Choosing the wrong one leads to slow code or unexpected results. Here is the definitive comparison.

Comparison Table

MethodWorks OnInput to FunctionReturnsPrimary Use
Series.apply()One columnEach elementSeriesCustom element-wise transformation
Series.map()One columnEach element (or dict/Series for lookup)SeriesValue substitution, simple mapping
DataFrame.apply(axis=0)DataFrameEach column (as Series)Series or DataFrameColumn-level aggregation
DataFrame.apply(axis=1)DataFrameEach row (as Series)Series or DataFrameRow-level logic across columns
DataFrame.map()DataFrameEach elementDataFrameElement-wise on entire DataFrame (pandas 2.1+)
DataFrame.applymap()DataFrameEach elementDataFrameSame as DataFrame.map() (deprecated in 2.1)
DataFrame.transform()DataFrameEach column or groupSame shape as inputMust return same-length output; used with groupby

When to Use Each

# 1. Series.apply() -- custom logic per element
df['salary_grade'] = df['salary'].apply(
    lambda x: 'A' if x > 85000 else 'B' if x > 70000 else 'C'
)
 
# 2. Series.map() -- direct substitution from a dictionary
grade_labels = {'A': 'Executive', 'B': 'Manager', 'C': 'Staff'}
df['grade_label'] = df['salary_grade'].map(grade_labels)
 
# 3. DataFrame.apply(axis=0) -- aggregate each column
summary = df[['salary', 'years_exp']].apply(np.mean, axis=0)
 
# 4. DataFrame.apply(axis=1) -- row-wise logic
df['score'] = df.apply(
    lambda r: r['rating'] * 10 + r['years_exp'], axis=1
)
 
# 5. DataFrame.map() -- element-wise on entire DataFrame (pandas >= 2.1)
numeric_df = df[['salary', 'years_exp']].map(lambda x: x * 1.1)
 
# 6. transform() -- must preserve shape, often with groupby
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
 
print(df[['name_clean', 'salary', 'salary_grade', 'grade_label', 'score', 'dept_avg_salary']])

Output:

      name_clean  salary salary_grade grade_label  score  dept_avg_salary
0  Alice Johnson   75000            B     Manager   47.0          70000.0
1      Bob Smith   82000            B     Manager   46.0          76000.0
2  Charlie Brown   65000            C       Staff   48.0          70000.0
3   Diana Prince   91000            A   Executive   61.0          91000.0
4     Eve Torres   70000            C       Staff   37.0          76000.0

Key Differences at a Glance

  • apply vs map: map() accepts dictionaries and Series for lookups. apply() only accepts callables. For simple mappings, map() is more readable and slightly faster.
  • apply vs transform: transform() must return output with the same length as input. apply() can return any shape. Use transform() inside groupby() when you want per-group computations broadcast back to original rows.
  • apply vs applymap: applymap() is deprecated since pandas 2.1. Use DataFrame.map() instead for element-wise operations on an entire DataFrame.

Real-World Examples

Data Cleaning Pipeline

This example demonstrates a typical data cleaning workflow where apply handles operations that lack clean vectorized alternatives.

import re
 
# Raw messy data
raw_df = pd.DataFrame({
    'date_str': ['2024-01-15', 'Jan 20, 2024', '15/02/2024', '2024.03.10', 'March 5 2024'],
    'phone': ['(555) 123-4567', '555.987.6543', '555 456 7890', '+1-555-222-3333', '5551114444'],
    'amount': ['$1,234.56', '2345.67', '$890', '1,100.00', '$45.5'],
    'status': ['  Active ', 'INACTIVE', 'active', 'Pending', ' ACTIVE']
})
 
print("Before cleaning:")
print(raw_df)
# Step 1: Normalize dates
def parse_date(date_str):
    """Try multiple date formats."""
    formats = ['%Y-%m-%d', '%b %d, %Y', '%d/%m/%Y', '%Y.%m.%d', '%B %d %Y']
    for fmt in formats:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except ValueError:
            continue
    return pd.NaT
 
raw_df['date_parsed'] = raw_df['date_str'].apply(parse_date)
 
# Step 2: Normalize phone numbers
def clean_phone(phone):
    """Extract digits and format consistently."""
    digits = re.sub(r'\D', '', phone)
    if len(digits) == 11 and digits.startswith('1'):
        digits = digits[1:]  # Remove country code
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    return phone  # Return original if unexpected format
 
raw_df['phone_clean'] = raw_df['phone'].apply(clean_phone)
 
# Step 3: Parse currency amounts
def parse_amount(amount_str):
    """Remove currency symbols and commas, convert to float."""
    cleaned = re.sub(r'[$,]', '', str(amount_str))
    try:
        return float(cleaned)
    except ValueError:
        return np.nan
 
raw_df['amount_parsed'] = raw_df['amount'].apply(parse_amount)
 
# Step 4: Normalize status (this one is better vectorized)
raw_df['status_clean'] = raw_df['status'].str.strip().str.lower()
 
print("\nAfter cleaning:")
print(raw_df[['date_parsed', 'phone_clean', 'amount_parsed', 'status_clean']])

Output:

After cleaning:
  date_parsed       phone_clean  amount_parsed status_clean
0  2024-01-15  (555) 123-4567        1234.56       active
1  2024-01-20  (555) 987-6543        2345.67     inactive
2  2024-02-15  (555) 456-7890         890.00       active
3  2024-03-10  (555) 222-3333        1100.00      pending
4  2024-03-05  (555) 111-4444          45.50       active

Notice that step 4 uses .str accessor methods (vectorized) instead of apply -- that is deliberate. Use apply only where vectorized methods fall short. For a deeper dive into handling NaN results from cleaning operations, see the pandas fillna guide.

Feature Engineering for Machine Learning

# Employee data for ML feature engineering
ml_df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'salary': [75000, 82000, 65000, 91000, 70000, 95000, 58000, 88000],
    'years_exp': [5, 8, 3, 12, 6, 15, 2, 10],
    'department': ['Eng', 'Mkt', 'Eng', 'Sales', 'Mkt', 'Eng', 'Sales', 'Eng'],
    'rating': [4.2, 3.8, 4.5, 4.9, 3.1, 4.0, 3.5, 4.7],
    'last_promotion_years_ago': [1, 3, 0, 2, 4, 1, 5, 2]
})
 
# Feature 1: Salary per year of experience
ml_df['salary_per_exp'] = ml_df.apply(
    lambda r: r['salary'] / max(r['years_exp'], 1), axis=1
)
 
# Feature 2: Composite engagement score
def engagement_score(row):
    """Combine rating, tenure, and promotion recency into a single score."""
    rating_score = row['rating'] / 5.0  # Normalize to 0-1
    tenure_score = min(row['years_exp'] / 15.0, 1.0)  # Cap at 15 years
    promo_score = max(1 - (row['last_promotion_years_ago'] / 5.0), 0)
 
    # Weighted combination
    return round(rating_score * 0.5 + tenure_score * 0.3 + promo_score * 0.2, 3)
 
ml_df['engagement'] = ml_df.apply(engagement_score, axis=1)
 
# Feature 3: Salary deviation from department average
dept_avg = ml_df.groupby('department')['salary'].transform('mean')
ml_df['salary_vs_dept'] = ((ml_df['salary'] - dept_avg) / dept_avg * 100).round(1)
 
# Feature 4: Risk flag combining multiple signals
def attrition_risk(row):
    """Flag employees at risk of leaving."""
    risk_factors = 0
    if row['rating'] < 3.5:
        risk_factors += 1
    if row['last_promotion_years_ago'] >= 4:
        risk_factors += 1
    if row['salary_vs_dept'] < -10:
        risk_factors += 1
    if row['years_exp'] >= 5 and row['last_promotion_years_ago'] >= 3:
        risk_factors += 1
 
    if risk_factors >= 3:
        return 'High'
    elif risk_factors >= 2:
        return 'Medium'
    else:
        return 'Low'
 
ml_df['risk'] = ml_df.apply(attrition_risk, axis=1)
 
print(ml_df[['name', 'salary_per_exp', 'engagement', 'salary_vs_dept', 'risk']])

Output:

      name  salary_per_exp  engagement  salary_vs_dept   risk
0    Alice        15000.00       0.683            -3.9    Low
1      Bob        10250.00       0.580             7.9    Low
2  Charlie        21666.67       0.680           -16.7    Low
3    Diana         7583.33       0.870             0.0    Low
4      Eve        11666.67       0.550            -7.9  Medium
5    Frank         6333.33       0.700            21.8    Low
6    Grace        29000.00       0.420           -37.4   High
7    Henry         8800.00       0.801            12.8    Low

Returning Multiple Columns from apply

When a function produces several outputs, return a pd.Series and pandas expands it into columns automatically. You can then combine the result with the original DataFrame using pd.concat().

def salary_breakdown(row):
    """Break salary into components."""
    gross = row['salary']
    federal_tax = gross * 0.22
    state_tax = gross * 0.05
    insurance = 500 * 12  # $500/month
    retirement = gross * 0.06
    net = gross - federal_tax - state_tax - insurance - retirement
 
    return pd.Series({
        'federal_tax': federal_tax,
        'state_tax': state_tax,
        'insurance': insurance,
        'retirement': retirement,
        'net_pay': net
    })
 
breakdown = ml_df.apply(salary_breakdown, axis=1)
result = pd.concat([ml_df[['name', 'salary']], breakdown], axis=1)
print(result)

Output:

      name  salary  federal_tax  state_tax  insurance  retirement   net_pay
0    Alice   75000      16500.0     3750.0     6000.0      4500.0   44250.0
1      Bob   82000      18040.0     4100.0     6000.0      4920.0   48940.0
2  Charlie   65000      14300.0     3250.0     6000.0      3900.0   37550.0
3    Diana   91000      20020.0     4550.0     6000.0      5460.0   54970.0
4      Eve   70000      15400.0     3500.0     6000.0      4200.0   40900.0
5    Frank   95000      20900.0     4750.0     6000.0      5700.0   57650.0
6    Grace   58000      12760.0     2900.0     6000.0      3480.0   32860.0
7    Henry   88000      19360.0     4400.0     6000.0      5280.0   52960.0

The result_type Parameter

When DataFrame.apply() returns list-like objects from each row, the result_type parameter controls the output format.

result_typeBehaviorWhen to use
None (default)Pandas infers the output typeGeneral purpose
'expand'List-like results become separate columnsWhen function returns a list or tuple
'reduce'Return Series instead of DataFrame if possibleAggregation that returns scalars
'broadcast'Output has same shape as input DataFrameWhen you want element-level results broadcast to all rows
# Without result_type -- returns a Series of lists
result_default = df[['salary', 'years_exp']].apply(
    lambda row: [row['salary'], row['salary'] * 1.1],
    axis=1
)
print("Default:")
print(result_default)
print(type(result_default.iloc[0]))
# With result_type='expand' -- splits list into columns
result_expand = df[['salary', 'years_exp']].apply(
    lambda row: [row['salary'], row['salary'] * 1.1],
    axis=1,
    result_type='expand'
)
result_expand.columns = ['current', 'projected']
print("\nExpand:")
print(result_expand)

Output:

Expand:
   current  projected
0    75000    82500.0
1    82000    90200.0
2    65000    71500.0
3    91000   100100.0
4    70000    77000.0

Using apply() with groupby()

Combining groupby() with apply() runs your function on each group independently -- a powerful pattern for group-level transformations.

# Normalize salary within each department (z-score)
def zscore_group(group):
    return (group - group.mean()) / group.std()
 
df['salary_zscore'] = df.groupby('department')['salary'].apply(zscore_group)
print(df[['name_clean', 'department', 'salary', 'salary_zscore']])
# Custom group aggregation
def department_report(group):
    return pd.Series({
        'headcount': len(group),
        'avg_salary': group['salary'].mean(),
        'avg_rating': group['rating'].mean(),
        'top_performer': group.loc[group['rating'].idxmax(), 'name_clean']
    })
 
report = df.groupby('department').apply(department_report)
print(report)

Output:

             headcount  avg_salary  avg_rating  top_performer
department
Engineering          2     70000.0        4.35  Charlie Brown
Marketing            2     76000.0        3.45      Bob Smith
Sales                1     91000.0        4.90   Diana Prince

Using Progress Bars with tqdm

Long-running apply operations can feel like a black box. The tqdm library adds a progress bar with one line of code.

from tqdm import tqdm
 
# Register tqdm with pandas
tqdm.pandas()
 
# Now use progress_apply instead of apply
# df['result'] = df['column'].progress_apply(slow_function)
 
# Example with a simulated slow function
import time
 
def slow_transform(value):
    time.sleep(0.01)  # Simulate processing time
    return value * 2
 
# Shows: 100%|##########| 5/5 [00:00<00:00, 98.33it/s]
# df['doubled'] = df['salary'].progress_apply(slow_transform)

Performance: When NOT to Use apply()

This section is critical. The apply() method executes a Python function in a loop, one element or row at a time. Pandas' built-in operations run in compiled C code, making them 10-100x faster.

Benchmark: apply vs Vectorized

import time
 
# Create a large DataFrame
n = 500_000
large_df = pd.DataFrame({
    'a': np.random.randn(n),
    'b': np.random.randn(n),
    'c': np.random.choice(['X', 'Y', 'Z'], n)
})
 
# Test 1: Simple arithmetic
start = time.time()
r1 = large_df['a'].apply(lambda x: x ** 2 + 1)
t_apply = time.time() - start
 
start = time.time()
r2 = large_df['a'] ** 2 + 1
t_vec = time.time() - start
 
print(f"Squaring {n:,} values:")
print(f"  apply():    {t_apply:.4f}s")
print(f"  vectorized: {t_vec:.4f}s")
print(f"  speedup:    {t_apply/t_vec:.0f}x")

Typical output on a modern machine:

Squaring 500,000 values:
  apply():    0.1523s
  vectorized: 0.0018s
  speedup:    85x
# Test 2: Row-wise operation with axis=1
start = time.time()
r3 = large_df.apply(lambda row: row['a'] + row['b'], axis=1)
t_apply_row = time.time() - start
 
start = time.time()
r4 = large_df['a'] + large_df['b']
t_vec_row = time.time() - start
 
print(f"\nRow-wise addition of {n:,} rows:")
print(f"  apply(axis=1): {t_apply_row:.4f}s")
print(f"  vectorized:    {t_vec_row:.4f}s")
print(f"  speedup:       {t_apply_row/t_vec_row:.0f}x")

Typical output:

Row-wise addition of 500,000 rows:
  apply(axis=1): 4.8721s
  vectorized:    0.0023s
  speedup:       2118x

Row-wise apply with axis=1 is especially slow because pandas constructs a new Series object for every single row. If you find yourself iterating row by row, also consider whether iterrows() fits your use case -- though apply is generally faster than explicit iteration.

Vectorized Alternatives Cheatsheet

OperationSlow (apply)Fast (vectorized)
Arithmetic.apply(lambda x: x * 2)col * 2
Conditional.apply(lambda x: 'A' if x > 10 else 'B')np.where(col > 10, 'A', 'B')
Multiple conditions.apply(complex_if_elif)np.select([cond1, cond2], [val1, val2], default)
String case.apply(lambda x: x.lower()).str.lower()
String contains.apply(lambda x: 'abc' in x).str.contains('abc')
String split.apply(lambda x: x.split(',')[0]).str.split(',').str[0]
Date year.apply(lambda x: x.year).dt.year
Date formatting.apply(lambda x: x.strftime('%Y-%m')).dt.strftime('%Y-%m')
Clipping values.apply(lambda x: min(max(x, 0), 100)).clip(0, 100)
Absolute value.apply(abs).abs() or np.abs(col)
Rounding.apply(lambda x: round(x, 2)).round(2)

Multi-Condition Example: np.select vs apply

# SLOW: Using apply with if/elif
def categorize_slow(row):
    if row['a'] > 1 and row['b'] > 1:
        return 'Both High'
    elif row['a'] > 1:
        return 'A High'
    elif row['b'] > 1:
        return 'B High'
    else:
        return 'Neither'
 
start = time.time()
large_df['cat_slow'] = large_df.apply(categorize_slow, axis=1)
t_slow = time.time() - start
 
# FAST: Using np.select
conditions = [
    (large_df['a'] > 1) & (large_df['b'] > 1),
    large_df['a'] > 1,
    large_df['b'] > 1,
]
choices = ['Both High', 'A High', 'B High']
 
start = time.time()
large_df['cat_fast'] = np.select(conditions, choices, default='Neither')
t_fast = time.time() - start
 
print(f"apply(axis=1): {t_slow:.3f}s")
print(f"np.select:     {t_fast:.3f}s")
print(f"Speedup:       {t_slow/t_fast:.0f}x")

Typical output:

apply(axis=1): 8.234s
np.select:     0.025s
Speedup:       329x

When apply() IS Justified

Despite the performance cost, apply is the right choice in these situations:

  1. Complex logic with no vectorized equivalent: Functions calling external APIs, parsing nested JSON, or implementing business rules with many branches.
  2. Small DataFrames: Under ~10,000 rows, the speed difference is negligible and readability matters more.
  3. Prototyping: Write it with apply first, optimize to vectorized later if performance matters.
  4. Functions returning multiple values: When you need to produce several new columns from complex row-level logic, returning a pd.Series from apply is clean and practical.
  5. Try/except error handling: Vectorized operations raise errors on the entire column. Apply lets you handle exceptions per element.

The raw Parameter for Speed

If your function only needs the numeric values (not column names or index), set raw=True to pass NumPy arrays instead of Series objects. This avoids the overhead of constructing a Series for each call.

# Without raw: pandas constructs a Series for each row
start = time.time()
result_series = large_df[['a', 'b']].apply(lambda row: row[0] + row[1], axis=1, raw=False)
t_series = time.time() - start
 
# With raw: pandas passes a numpy array
start = time.time()
result_raw = large_df[['a', 'b']].apply(lambda row: row[0] + row[1], axis=1, raw=True)
t_raw = time.time() - start
 
print(f"raw=False: {t_series:.3f}s")
print(f"raw=True:  {t_raw:.3f}s")
print(f"Speedup:   {t_series/t_raw:.1f}x")

Typical output:

raw=False: 4.872s
raw=True:  0.943s
Speedup:   5.2x

Setting raw=True gives a 3-5x speedup when you do not need index or column labels. Still slower than full vectorization, but a useful middle ground.

Handling Errors Inside apply

Real-world data is messy. Your function will encounter NaN values, unexpected types, and malformed strings. Wrapping logic in try/except prevents one bad row from crashing the entire operation.

messy_data = pd.DataFrame({
    'value': ['42', '3.14', 'N/A', '100', '', None, 'abc', '7.5']
})
 
def safe_float(val):
    """Convert to float with error handling."""
    try:
        return float(val)
    except (ValueError, TypeError):
        return np.nan
 
messy_data['parsed'] = messy_data['value'].apply(safe_float)
print(messy_data)

Output:

  value  parsed
0    42   42.00
1  3.14    3.14
2   N/A     NaN
3   100  100.00
4          NaN
5  None    NaN
6   abc    NaN
7   7.5    7.50

Debugging apply Functions

Test your function on a single element before applying it to the entire DataFrame:

# Step 1: Test on one row
test_row = df.iloc[0]
print("Test input:", test_row.to_dict())
 
result = calc_bonus(test_row)
print("Test output:", result)
 
# Step 2: Test on a small slice
small_result = df.head(3).apply(calc_bonus, axis=1)
print("Small batch:", small_result.tolist())
 
# Step 3: Apply to full DataFrame
df['bonus'] = df.apply(calc_bonus, axis=1)

Common Errors and Fixes

TypeError: 'float' object is not iterable

This happens when your function returns a scalar but pandas expects a sequence, or vice versa.

# ERROR: Returning a list when pandas expects a scalar
def broken(x):
    return [x, x * 2]  # Returns list, assigned to single column
 
# This creates a column of lists, not two columns
df['result'] = df['salary'].apply(broken)
# Each cell is [salary, salary*2] -- probably not what you want
 
# FIX: Use result_type='expand' on DataFrame.apply
result = df[['salary']].apply(
    lambda row: [row['salary'], row['salary'] * 2],
    axis=1,
    result_type='expand'
)
result.columns = ['original', 'doubled']

SettingWithCopyWarning

This occurs when you apply on a slice of a DataFrame rather than the original.

# WARNING: Operating on a slice
subset = df[df['department'] == 'Engineering']
subset['bonus'] = subset.apply(calc_bonus, axis=1)  # SettingWithCopyWarning
 
# FIX: Use .loc or .copy()
subset = df[df['department'] == 'Engineering'].copy()
subset['bonus'] = subset.apply(calc_bonus, axis=1)  # Clean
 
# Or use .loc on the original DataFrame
df.loc[df['department'] == 'Engineering', 'bonus'] = (
    df[df['department'] == 'Engineering'].apply(calc_bonus, axis=1)
)

KeyError When Accessing Columns in axis=1

When using axis=1, each row is a Series with column names as index. A typo in the column name raises a KeyError.

# ERROR: Typo in column name
def broken_func(row):
    return row['salry']  # Typo: should be 'salary'
 
# df.apply(broken_func, axis=1)  # KeyError: 'salry'
 
# FIX: Print column names first
print(df.columns.tolist())
# ['name', 'email', 'salary', 'department', 'years_exp', 'rating', ...]
 
# Or use .get() for safe access
def safe_func(row):
    return row.get('salary', 0) * row.get('years_exp', 1)

Function Returns Wrong Shape

When your function returns inconsistent types (sometimes scalar, sometimes Series), pandas cannot build the result properly.

# ERROR: Inconsistent return types
def inconsistent(row):
    if row['salary'] > 80000:
        return pd.Series({'bonus': 5000, 'flag': True})
    else:
        return 0  # Returns scalar for some rows, Series for others
 
# FIX: Always return the same type
def consistent(row):
    if row['salary'] > 80000:
        return pd.Series({'bonus': 5000, 'flag': True})
    else:
        return pd.Series({'bonus': 0, 'flag': False})
 
result = df.apply(consistent, axis=1)
print(result)

Visualize Your Transformed Data

After applying transformations, explore the results visually to validate your logic and uncover patterns. PyGWalker (opens in a new tab) turns any pandas DataFrame into a Tableau-like drag-and-drop interface right inside Jupyter notebooks -- no chart code needed.

import pygwalker as pyg
 
# After applying transformations, visualize interactively
walker = pyg.walk(df)

With PyGWalker you can:

  • Compare original vs transformed columns with side-by-side charts
  • Validate conditional logic by filtering and grouping transformed data
  • Spot outliers in computed fields through distribution plots
  • Build charts by dragging columns -- no matplotlib or seaborn syntax required

For an even smoother experience, try RunCell (opens in a new tab) -- an AI-powered Jupyter environment where an agent can help you write apply functions, debug errors, and generate visualizations through natural language. It is especially useful when experimenting with complex row-wise transformations.

Summary: apply() Decision Flowchart

Use this mental checklist every time you reach for apply():

  1. Is there a built-in pandas method? (.str, .dt, arithmetic, np.where, np.select) -- Use it. 10-100x faster.
  2. Is it a simple value mapping? -- Use Series.map() with a dictionary.
  3. Do you need the same shape back with groupby? -- Use transform().
  4. Is the logic complex, multi-branched, or involves try/except? -- Use apply(). It is the right tool here.
  5. Is the DataFrame small (under 10k rows)? -- Use apply(). Readability wins.
  6. Is it a row-wise operation on a large DataFrame? -- Think twice. Can you restructure with vectorized ops? If not, set raw=True for a partial speedup.

FAQ

Is pandas apply slow?

Yes, compared to vectorized operations. apply() runs a Python loop internally, which is 10-100x slower than pandas' built-in C-optimized methods. For simple math, string operations, or conditionals, always prefer vectorized alternatives like np.where(), .str accessor, or direct column arithmetic. Apply is appropriate for complex logic that has no vectorized equivalent, or for DataFrames under 10,000 rows where the difference is negligible.

What is the difference between apply and map in pandas?

Series.apply() takes any callable function and runs it on each element. Series.map() also accepts dictionaries and other Series for value substitution. For lookups (mapping one value to another), map() is more readable and often faster. For custom transformation logic, apply() is the correct choice. On DataFrames, apply() can process entire rows or columns, while map() (replacing the deprecated applymap()) works element-by-element.

How do I use apply with axis=1 to access multiple columns?

Set axis=1 and the function receives each row as a Series. Access columns by name: df.apply(lambda row: row['col_a'] + row['col_b'], axis=1). For named functions, use the same pattern: def func(row): return row['price'] * row['quantity']. Note that axis=1 is significantly slower than vectorized operations, so prefer df['col_a'] + df['col_b'] when possible.

Can I pass extra arguments to the function in apply?

Yes. Use the args parameter for positional arguments and keyword arguments for named ones: df['col'].apply(func, args=(10, 20), multiplier=1.5). Inside the function, the first parameter is the element (or row/column), followed by the positional args, then keyword args.

How do I return multiple columns from a single apply call?

Return a pd.Series with named indices from your function: return pd.Series({'col1': val1, 'col2': val2}). When used with axis=1, pandas automatically expands the result into a DataFrame with those column names. Alternatively, return a list and use result_type='expand' to split it into columns.

When should I use apply vs transform in pandas?

Use transform() when you need the output to have the exact same shape as the input -- this is common with groupby() for broadcasting group-level statistics back to individual rows (e.g., df.groupby('dept')['salary'].transform('mean')). Use apply() when the output shape can differ from the input, such as returning a single summary row per group or computing derived columns with complex logic.

Related Guides

📚