Section 3: Libraries & Imports
Why reinvent the wheel when someone else already built a Ferrari? Python libraries are like having a team of expert programmers working for you 24/7. Instead of writing complex code to analyze data, you import pandas and get professional-grade tools in one line. It’s the difference between building a car from scratch and driving a Tesla - both get you there, but one is infinitely more efficient.
Introduction
Python’s effectiveness comes from its extensive library ecosystem. Libraries provide pre-written code for common tasks, saving you time and effort. In data science, libraries like pandas, numpy, and matplotlib are core tools that handle complex operations with simple commands.
Understanding Python Libraries
Libraries are collections of pre-written code that extend Python’s capabilities. They provide functions, classes, and tools for specific tasks.
Built-in vs External Libraries
# Built-in libraries (come with Python)
import math
import random
import datetime
import json
# External libraries (need to be installed)
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as pltWhy Libraries Matter in Data Science
- Efficiency: Pre-written, optimized code
- Reliability: Tested by thousands of developers
- Specialization: Tools designed for specific tasks
- Community: Active development and support
Importing Libraries
Python provides several ways to import libraries and their functions.
Basic Import Methods
# Import entire library
import math
result = math.sqrt(16) # 4.0
# Import specific functions
from math import sqrt, pi
result = sqrt(16) # 4.0
print(pi) # 3.141592653589793
# Import with alias
import math as m
result = m.sqrt(16) # 4.0
# Import all functions (not recommended)
from math import *
result = sqrt(16) # 4.0Import Best Practices
# Good: Import at the top of file
import os
import json
from datetime import datetime
# Good: Use descriptive aliases
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Good: Group related imports
# Standard library imports
import os
import json
from datetime import datetime
# Third-party imports
import pandas as pd
import numpy as np
# Local imports
from my_module import my_functionCore Data Science Libraries
NumPy - Numerical Computing
NumPy provides high-performance array operations and mathematical functions.
import numpy as np
# Create arrays
numbers = np.array([1, 2, 3, 4, 5])
print(f"Array: {numbers}")
print(f"Type: {type(numbers)}")
# Array operations
print(f"Sum: {np.sum(numbers)}")
print(f"Mean: {np.mean(numbers)}")
print(f"Max: {np.max(numbers)}")
print(f"Min: {np.min(numbers)}")
# Mathematical operations
squared = numbers ** 2
print(f"Squared: {squared}")
# Array creation
zeros = np.zeros(5)
ones = np.ones(5)
range_array = np.arange(0, 10, 2)
print(f"Zeros: {zeros}")
print(f"Ones: {ones}")
print(f"Range: {range_array}")
# 2D arrays
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Matrix:\n{matrix}")
print(f"Shape: {matrix.shape}")Pandas - Data Manipulation
Pandas provides specialized tools for working with structured data.
import pandas as pd
# Create DataFrame from dictionary
data = {
'Name': ['Alice', 'Bob', 'Carol', 'David'],
'Age': [25, 30, 35, 28],
'Salary': [50000, 60000, 70000, 55000],
'Department': ['IT', 'HR', 'IT', 'Finance']
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
# Basic operations
print(f"\nShape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
# Accessing data
print(f"\nFirst 2 rows:\n{df.head(2)}")
print(f"\nAges:\n{df['Age']}")
print(f"\nIT Department:\n{df[df['Department'] == 'IT']}")
# Statistical summary
print(f"\nSummary statistics:\n{df.describe()}")Matplotlib - Data Visualization
Matplotlib creates charts and graphs for data visualization.
import matplotlib.pyplot as plt
import numpy as np
# Create sample data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='sin(x)', color='blue', linewidth=2)
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Sine Wave')
plt.legend()
plt.grid(True)
plt.show()
# Bar chart example
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
plt.figure(figsize=(8, 6))
plt.bar(categories, values, color=['red', 'green', 'blue', 'orange'])
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Sample Bar Chart')
plt.show()Working with External Libraries
Installing Libraries
# Install single library
pip install pandas
# Install multiple libraries
pip install pandas numpy matplotlib
# Install specific version
pip install pandas==1.5.0
# Install from requirements file
pip install -r requirements.txtRequirements File
Create a requirements.txt file to manage dependencies:
pandas>=1.5.0
numpy>=1.21.0
matplotlib>=3.5.0
seaborn>=0.11.0
scikit-learn>=1.0.0Virtual Environments
# Create virtual environment
python -m venv myenv
# Activate (Windows)
myenv\Scripts\activate
# Activate (Mac/Linux)
source myenv/bin/activate
# Install packages in virtual environment
pip install pandas numpy matplotlib
# Deactivate
deactivateData Science Workflow with Libraries
Complete Analysis Example
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
# Create sample sales data
np.random.seed(42) # For reproducible results
dates = pd.date_range('2024-01-01', periods=100, freq='D')
sales_data = {
'Date': dates,
'Sales': np.random.normal(1000, 200, 100),
'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'Product': np.random.choice(['A', 'B', 'C'], 100)
}
# Create DataFrame
df = pd.DataFrame(sales_data)
# Data exploration
print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst 5 rows:")
print(df.head())
# Basic statistics
print(f"\nSales Statistics:")
print(df['Sales'].describe())
# Regional analysis
regional_sales = df.groupby('Region')['Sales'].agg(['sum', 'mean', 'count'])
print(f"\nRegional Analysis:")
print(regional_sales)
# Product analysis
product_sales = df.groupby('Product')['Sales'].sum()
print(f"\nProduct Sales:")
print(product_sales)
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Sales over time
axes[0, 0].plot(df['Date'], df['Sales'])
axes[0, 0].set_title('Sales Over Time')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Sales')
# Regional sales
regional_sales['sum'].plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Total Sales by Region')
axes[0, 1].set_xlabel('Region')
axes[0, 1].set_ylabel('Total Sales')
# Product sales
product_sales.plot(kind='pie', ax=axes[1, 0])
axes[1, 0].set_title('Sales Distribution by Product')
# Sales histogram
axes[1, 1].hist(df['Sales'], bins=20, alpha=0.7)
axes[1, 1].set_title('Sales Distribution')
axes[1, 1].set_xlabel('Sales')
axes[1, 1].set_ylabel('Frequency')
plt.tight_layout()
plt.show()Advanced Library Usage
Custom Data Analysis Functions
import pandas as pd
import numpy as np
from typing import List, Dict, Any
def analyze_sales_data(df: pd.DataFrame) -> Dict[str, Any]:
"""Comprehensive sales data analysis"""
analysis = {}
# Basic metrics
analysis['total_sales'] = df['Sales'].sum()
analysis['average_sales'] = df['Sales'].mean()
analysis['max_sales'] = df['Sales'].max()
analysis['min_sales'] = df['Sales'].min()
# Growth analysis
df_sorted = df.sort_values('Date')
df_sorted['Sales_Growth'] = df_sorted['Sales'].pct_change() * 100
analysis['average_growth'] = df_sorted['Sales_Growth'].mean()
# Regional performance
regional_analysis = df.groupby('Region')['Sales'].agg([
'sum', 'mean', 'count'
]).round(2)
analysis['regional_performance'] = regional_analysis.to_dict()
# Product performance
product_analysis = df.groupby('Product')['Sales'].agg([
'sum', 'mean', 'count'
]).round(2)
analysis['product_performance'] = product_analysis.to_dict()
# Time-based analysis
df['Month'] = df['Date'].dt.month
monthly_sales = df.groupby('Month')['Sales'].sum()
analysis['monthly_sales'] = monthly_sales.to_dict()
return analysis
# Use the function
analysis_results = analyze_sales_data(df)
print("Analysis Results:")
for key, value in analysis_results.items():
print(f"{key}: {value}")Error Handling with Libraries
def safe_data_analysis(file_path: str) -> Dict[str, Any]:
"""Perform data analysis with error handling"""
try:
# Try to read the file
if file_path.endswith('.csv'):
df = pd.read_csv(file_path)
elif file_path.endswith('.xlsx'):
df = pd.read_excel(file_path)
else:
raise ValueError("Unsupported file format")
# Validate data
if df.empty:
raise ValueError("File is empty")
# Perform analysis
analysis = analyze_sales_data(df)
analysis['status'] = 'success'
analysis['records_processed'] = len(df)
return analysis
except FileNotFoundError:
return {'status': 'error', 'message': 'File not found'}
except pd.errors.EmptyDataError:
return {'status': 'error', 'message': 'File is empty'}
except Exception as e:
return {'status': 'error', 'message': str(e)}
# Test the function
result = safe_data_analysis('sales_data.csv')
print(result)Practice Exercise
Create a comprehensive data analysis system using multiple libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
class DataAnalysisSuite:
"""Comprehensive data analysis suite using multiple libraries"""
def __init__(self, data_source):
self.data_source = data_source
self.df = None
self.analysis_results = {}
def load_data(self, file_path):
"""Load data from various file formats"""
try:
if file_path.endswith('.csv'):
self.df = pd.read_csv(file_path)
elif file_path.endswith('.xlsx'):
self.df = pd.read_excel(file_path)
else:
raise ValueError("Unsupported file format")
print(f"Data loaded successfully: {self.df.shape[0]} rows, {self.df.shape[1]} columns")
return True
except Exception as e:
print(f"Error loading data: {e}")
return False
def generate_sample_data(self, n_records=1000):
"""Generate sample sales data for demonstration"""
np.random.seed(42)
# Generate dates
start_date = datetime.now() - timedelta(days=n_records)
dates = [start_date + timedelta(days=i) for i in range(n_records)]
# Generate sample data
data = {
'Date': dates,
'Sales': np.random.normal(1000, 200, n_records),
'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
'Product': np.random.choice(['Product A', 'Product B', 'Product C', 'Product D'], n_records),
'Customer_ID': np.random.randint(1, 100, n_records),
'Discount': np.random.choice([0, 0.05, 0.1, 0.15], n_records)
}
self.df = pd.DataFrame(data)
print(f"Sample data generated: {self.df.shape[0]} records")
return True
def perform_comprehensive_analysis(self):
"""Perform comprehensive data analysis"""
if self.df is None:
print("No data loaded")
return None
analysis = {}
# Basic statistics
analysis['basic_stats'] = {
'total_records': len(self.df),
'total_sales': self.df['Sales'].sum(),
'average_sales': self.df['Sales'].mean(),
'median_sales': self.df['Sales'].median(),
'std_sales': self.df['Sales'].std(),
'min_sales': self.df['Sales'].min(),
'max_sales': self.df['Sales'].max()
}
# Regional analysis
regional_analysis = self.df.groupby('Region')['Sales'].agg([
'sum', 'mean', 'count', 'std'
]).round(2)
analysis['regional_analysis'] = regional_analysis.to_dict()
# Product analysis
product_analysis = self.df.groupby('Product')['Sales'].agg([
'sum', 'mean', 'count', 'std'
]).round(2)
analysis['product_analysis'] = product_analysis.to_dict()
# Time-based analysis
self.df['Month'] = self.df['Date'].dt.month
self.df['Quarter'] = self.df['Date'].dt.quarter
self.df['Year'] = self.df['Date'].dt.year
monthly_sales = self.df.groupby('Month')['Sales'].sum()
quarterly_sales = self.df.groupby('Quarter')['Sales'].sum()
analysis['time_analysis'] = {
'monthly_sales': monthly_sales.to_dict(),
'quarterly_sales': quarterly_sales.to_dict()
}
# Customer analysis
customer_analysis = self.df.groupby('Customer_ID')['Sales'].agg([
'sum', 'count', 'mean'
]).round(2)
analysis['customer_analysis'] = {
'top_customers': customer_analysis.nlargest(10, 'sum').to_dict(),
'total_customers': customer_analysis.shape[0]
}
# Discount analysis
discount_analysis = self.df.groupby('Discount')['Sales'].agg([
'sum', 'mean', 'count'
]).round(2)
analysis['discount_analysis'] = discount_analysis.to_dict()
self.analysis_results = analysis
return analysis
def create_visualizations(self):
"""Create comprehensive visualizations"""
if self.df is None:
print("No data loaded")
return
# Set up the plotting style
plt.style.use('seaborn-v0_8')
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Comprehensive Sales Analysis Dashboard', fontsize=16, fontweight='bold')
# 1. Sales over time
daily_sales = self.df.groupby('Date')['Sales'].sum()
axes[0, 0].plot(daily_sales.index, daily_sales.values, linewidth=2)
axes[0, 0].set_title('Sales Over Time')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Total Sales')
axes[0, 0].tick_params(axis='x', rotation=45)
# 2. Regional sales
regional_sales = self.df.groupby('Region')['Sales'].sum()
axes[0, 1].bar(regional_sales.index, regional_sales.values, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
axes[0, 1].set_title('Sales by Region')
axes[0, 1].set_xlabel('Region')
axes[0, 1].set_ylabel('Total Sales')
# 3. Product sales pie chart
product_sales = self.df.groupby('Product')['Sales'].sum()
axes[0, 2].pie(product_sales.values, labels=product_sales.index, autopct='%1.1f%%')
axes[0, 2].set_title('Sales Distribution by Product')
# 4. Sales distribution histogram
axes[1, 0].hist(self.df['Sales'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[1, 0].set_title('Sales Distribution')
axes[1, 0].set_xlabel('Sales Amount')
axes[1, 0].set_ylabel('Frequency')
# 5. Monthly sales trend
monthly_sales = self.df.groupby('Month')['Sales'].sum()
axes[1, 1].plot(monthly_sales.index, monthly_sales.values, marker='o', linewidth=2, markersize=8)
axes[1, 1].set_title('Monthly Sales Trend')
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Total Sales')
axes[1, 1].set_xticks(range(1, 13))
# 6. Discount impact
discount_impact = self.df.groupby('Discount')['Sales'].mean()
axes[1, 2].bar(discount_impact.index, discount_impact.values, color='lightcoral')
axes[1, 2].set_title('Average Sales by Discount Level')
axes[1, 2].set_xlabel('Discount Rate')
axes[1, 2].set_ylabel('Average Sales')
plt.tight_layout()
plt.show()
def generate_report(self):
"""Generate comprehensive analysis report"""
if not self.analysis_results:
print("No analysis results available")
return
report = f"""
COMPREHENSIVE SALES ANALYSIS REPORT
{'=' * 60}
Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
Data Source: {self.data_source}
BASIC STATISTICS:
- Total Records: {self.analysis_results['basic_stats']['total_records']:,}
- Total Sales: ${self.analysis_results['basic_stats']['total_sales']:,.2f}
- Average Sales: ${self.analysis_results['basic_stats']['average_sales']:,.2f}
- Median Sales: ${self.analysis_results['basic_stats']['median_sales']:,.2f}
- Standard Deviation: ${self.analysis_results['basic_stats']['std_sales']:,.2f}
- Min Sales: ${self.analysis_results['basic_stats']['min_sales']:,.2f}
- Max Sales: ${self.analysis_results['basic_stats']['max_sales']:,.2f}
TOP REGIONS BY SALES:
"""
regional_data = self.analysis_results['regional_analysis']['sum']
for region, sales in sorted(regional_data.items(), key=lambda x: x[1], reverse=True):
report += f"- {region}: ${sales:,.2f}\n"
report += "\nTOP PRODUCTS BY SALES:\n"
product_data = self.analysis_results['product_analysis']['sum']
for product, sales in sorted(product_data.items(), key=lambda x: x[1], reverse=True):
report += f"- {product}: ${sales:,.2f}\n"
report += f"\nCUSTOMER INSIGHTS:\n"
report += f"- Total Unique Customers: {self.analysis_results['customer_analysis']['total_customers']}\n"
report += f"- Top 3 Customers by Total Spending:\n"
top_customers = self.analysis_results['customer_analysis']['top_customers']['sum']
for i, (customer_id, sales) in enumerate(list(top_customers.items())[:3], 1):
report += f" {i}. Customer {customer_id}: ${sales:,.2f}\n"
return report
# Example usage
analyzer = DataAnalysisSuite("sample_data")
# Generate sample data
analyzer.generate_sample_data(1000)
# Perform analysis
analysis = analyzer.perform_comprehensive_analysis()
# Create visualizations
analyzer.create_visualizations()
# Generate report
report = analyzer.generate_report()
print(report)Assets
Summary
Python libraries extend the language’s capabilities for data science. Key concepts include importing libraries, using core data science libraries like pandas and numpy, managing dependencies, and applying libraries to real-world analysis tasks. Libraries save time and provide reliable, tested tools for complex data operations.
© 2025 Prof. Tim Frenzel. All rights reserved. | Version 1.0.5