Section 3: Libraries & Imports

Why reinvent the wheel when someone else already built a Ferrari? Python libraries are like having a team of expert programmers working for you 24/7. Instead of writing complex code to analyze data, you import pandas and get professional-grade tools in one line. It’s the difference between building a car from scratch and driving a Tesla - both get you there, but one is infinitely more efficient.

Introduction

Python’s effectiveness comes from its extensive library ecosystem. Libraries provide pre-written code for common tasks, saving you time and effort. In data science, libraries like pandas, numpy, and matplotlib are core tools that handle complex operations with simple commands.

Understanding Python Libraries

Libraries are collections of pre-written code that extend Python’s capabilities. They provide functions, classes, and tools for specific tasks.

Built-in vs External Libraries

# Built-in libraries (come with Python)
import math
import random
import datetime
import json

# External libraries (need to be installed)
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt

Why Libraries Matter in Data Science

  • Efficiency: Pre-written, optimized code
  • Reliability: Tested by thousands of developers
  • Specialization: Tools designed for specific tasks
  • Community: Active development and support

Importing Libraries

Python provides several ways to import libraries and their functions.

Basic Import Methods

# Import entire library
import math
result = math.sqrt(16)  # 4.0

# Import specific functions
from math import sqrt, pi
result = sqrt(16)  # 4.0
print(pi)  # 3.141592653589793

# Import with alias
import math as m
result = m.sqrt(16)  # 4.0

# Import all functions (not recommended)
from math import *
result = sqrt(16)  # 4.0

Import Best Practices

# Good: Import at the top of file
import os
import json
from datetime import datetime

# Good: Use descriptive aliases
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Good: Group related imports
# Standard library imports
import os
import json
from datetime import datetime

# Third-party imports
import pandas as pd
import numpy as np

# Local imports
from my_module import my_function

Core Data Science Libraries

NumPy - Numerical Computing

NumPy provides high-performance array operations and mathematical functions.

import numpy as np

# Create arrays
numbers = np.array([1, 2, 3, 4, 5])
print(f"Array: {numbers}")
print(f"Type: {type(numbers)}")

# Array operations
print(f"Sum: {np.sum(numbers)}")
print(f"Mean: {np.mean(numbers)}")
print(f"Max: {np.max(numbers)}")
print(f"Min: {np.min(numbers)}")

# Mathematical operations
squared = numbers ** 2
print(f"Squared: {squared}")

# Array creation
zeros = np.zeros(5)
ones = np.ones(5)
range_array = np.arange(0, 10, 2)
print(f"Zeros: {zeros}")
print(f"Ones: {ones}")
print(f"Range: {range_array}")

# 2D arrays
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Matrix:\n{matrix}")
print(f"Shape: {matrix.shape}")

Pandas - Data Manipulation

Pandas provides specialized tools for working with structured data.

import pandas as pd

# Create DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Carol', 'David'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000],
    'Department': ['IT', 'HR', 'IT', 'Finance']
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)

# Basic operations
print(f"\nShape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")

# Accessing data
print(f"\nFirst 2 rows:\n{df.head(2)}")
print(f"\nAges:\n{df['Age']}")
print(f"\nIT Department:\n{df[df['Department'] == 'IT']}")

# Statistical summary
print(f"\nSummary statistics:\n{df.describe()}")

Matplotlib - Data Visualization

Matplotlib creates charts and graphs for data visualization.

import matplotlib.pyplot as plt
import numpy as np

# Create sample data
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Create plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='sin(x)', color='blue', linewidth=2)
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Sine Wave')
plt.legend()
plt.grid(True)
plt.show()

# Bar chart example
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]

plt.figure(figsize=(8, 6))
plt.bar(categories, values, color=['red', 'green', 'blue', 'orange'])
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Sample Bar Chart')
plt.show()

Working with External Libraries

Installing Libraries

# Install single library
pip install pandas

# Install multiple libraries
pip install pandas numpy matplotlib

# Install specific version
pip install pandas==1.5.0

# Install from requirements file
pip install -r requirements.txt

Requirements File

Create a requirements.txt file to manage dependencies:

pandas>=1.5.0
numpy>=1.21.0
matplotlib>=3.5.0
seaborn>=0.11.0
scikit-learn>=1.0.0

Virtual Environments

# Create virtual environment
python -m venv myenv

# Activate (Windows)
myenv\Scripts\activate

# Activate (Mac/Linux)
source myenv/bin/activate

# Install packages in virtual environment
pip install pandas numpy matplotlib

# Deactivate
deactivate

Data Science Workflow with Libraries

Complete Analysis Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

# Create sample sales data
np.random.seed(42)  # For reproducible results
dates = pd.date_range('2024-01-01', periods=100, freq='D')
sales_data = {
    'Date': dates,
    'Sales': np.random.normal(1000, 200, 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Product': np.random.choice(['A', 'B', 'C'], 100)
}

# Create DataFrame
df = pd.DataFrame(sales_data)

# Data exploration
print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nFirst 5 rows:")
print(df.head())

# Basic statistics
print(f"\nSales Statistics:")
print(df['Sales'].describe())

# Regional analysis
regional_sales = df.groupby('Region')['Sales'].agg(['sum', 'mean', 'count'])
print(f"\nRegional Analysis:")
print(regional_sales)

# Product analysis
product_sales = df.groupby('Product')['Sales'].sum()
print(f"\nProduct Sales:")
print(product_sales)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Sales over time
axes[0, 0].plot(df['Date'], df['Sales'])
axes[0, 0].set_title('Sales Over Time')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Sales')

# Regional sales
regional_sales['sum'].plot(kind='bar', ax=axes[0, 1])
axes[0, 1].set_title('Total Sales by Region')
axes[0, 1].set_xlabel('Region')
axes[0, 1].set_ylabel('Total Sales')

# Product sales
product_sales.plot(kind='pie', ax=axes[1, 0])
axes[1, 0].set_title('Sales Distribution by Product')

# Sales histogram
axes[1, 1].hist(df['Sales'], bins=20, alpha=0.7)
axes[1, 1].set_title('Sales Distribution')
axes[1, 1].set_xlabel('Sales')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

Advanced Library Usage

Custom Data Analysis Functions

import pandas as pd
import numpy as np
from typing import List, Dict, Any

def analyze_sales_data(df: pd.DataFrame) -> Dict[str, Any]:
    """Comprehensive sales data analysis"""
    analysis = {}
    
    # Basic metrics
    analysis['total_sales'] = df['Sales'].sum()
    analysis['average_sales'] = df['Sales'].mean()
    analysis['max_sales'] = df['Sales'].max()
    analysis['min_sales'] = df['Sales'].min()
    
    # Growth analysis
    df_sorted = df.sort_values('Date')
    df_sorted['Sales_Growth'] = df_sorted['Sales'].pct_change() * 100
    analysis['average_growth'] = df_sorted['Sales_Growth'].mean()
    
    # Regional performance
    regional_analysis = df.groupby('Region')['Sales'].agg([
        'sum', 'mean', 'count'
    ]).round(2)
    analysis['regional_performance'] = regional_analysis.to_dict()
    
    # Product performance
    product_analysis = df.groupby('Product')['Sales'].agg([
        'sum', 'mean', 'count'
    ]).round(2)
    analysis['product_performance'] = product_analysis.to_dict()
    
    # Time-based analysis
    df['Month'] = df['Date'].dt.month
    monthly_sales = df.groupby('Month')['Sales'].sum()
    analysis['monthly_sales'] = monthly_sales.to_dict()
    
    return analysis

# Use the function
analysis_results = analyze_sales_data(df)
print("Analysis Results:")
for key, value in analysis_results.items():
    print(f"{key}: {value}")

Error Handling with Libraries

def safe_data_analysis(file_path: str) -> Dict[str, Any]:
    """Perform data analysis with error handling"""
    try:
        # Try to read the file
        if file_path.endswith('.csv'):
            df = pd.read_csv(file_path)
        elif file_path.endswith('.xlsx'):
            df = pd.read_excel(file_path)
        else:
            raise ValueError("Unsupported file format")
        
        # Validate data
        if df.empty:
            raise ValueError("File is empty")
        
        # Perform analysis
        analysis = analyze_sales_data(df)
        analysis['status'] = 'success'
        analysis['records_processed'] = len(df)
        
        return analysis
        
    except FileNotFoundError:
        return {'status': 'error', 'message': 'File not found'}
    except pd.errors.EmptyDataError:
        return {'status': 'error', 'message': 'File is empty'}
    except Exception as e:
        return {'status': 'error', 'message': str(e)}

# Test the function
result = safe_data_analysis('sales_data.csv')
print(result)

Practice Exercise

Create a comprehensive data analysis system using multiple libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

class DataAnalysisSuite:
    """Comprehensive data analysis suite using multiple libraries"""
    
    def __init__(self, data_source):
        self.data_source = data_source
        self.df = None
        self.analysis_results = {}
    
    def load_data(self, file_path):
        """Load data from various file formats"""
        try:
            if file_path.endswith('.csv'):
                self.df = pd.read_csv(file_path)
            elif file_path.endswith('.xlsx'):
                self.df = pd.read_excel(file_path)
            else:
                raise ValueError("Unsupported file format")
            
            print(f"Data loaded successfully: {self.df.shape[0]} rows, {self.df.shape[1]} columns")
            return True
            
        except Exception as e:
            print(f"Error loading data: {e}")
            return False
    
    def generate_sample_data(self, n_records=1000):
        """Generate sample sales data for demonstration"""
        np.random.seed(42)
        
        # Generate dates
        start_date = datetime.now() - timedelta(days=n_records)
        dates = [start_date + timedelta(days=i) for i in range(n_records)]
        
        # Generate sample data
        data = {
            'Date': dates,
            'Sales': np.random.normal(1000, 200, n_records),
            'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
            'Product': np.random.choice(['Product A', 'Product B', 'Product C', 'Product D'], n_records),
            'Customer_ID': np.random.randint(1, 100, n_records),
            'Discount': np.random.choice([0, 0.05, 0.1, 0.15], n_records)
        }
        
        self.df = pd.DataFrame(data)
        print(f"Sample data generated: {self.df.shape[0]} records")
        return True
    
    def perform_comprehensive_analysis(self):
        """Perform comprehensive data analysis"""
        if self.df is None:
            print("No data loaded")
            return None
        
        analysis = {}
        
        # Basic statistics
        analysis['basic_stats'] = {
            'total_records': len(self.df),
            'total_sales': self.df['Sales'].sum(),
            'average_sales': self.df['Sales'].mean(),
            'median_sales': self.df['Sales'].median(),
            'std_sales': self.df['Sales'].std(),
            'min_sales': self.df['Sales'].min(),
            'max_sales': self.df['Sales'].max()
        }
        
        # Regional analysis
        regional_analysis = self.df.groupby('Region')['Sales'].agg([
            'sum', 'mean', 'count', 'std'
        ]).round(2)
        analysis['regional_analysis'] = regional_analysis.to_dict()
        
        # Product analysis
        product_analysis = self.df.groupby('Product')['Sales'].agg([
            'sum', 'mean', 'count', 'std'
        ]).round(2)
        analysis['product_analysis'] = product_analysis.to_dict()
        
        # Time-based analysis
        self.df['Month'] = self.df['Date'].dt.month
        self.df['Quarter'] = self.df['Date'].dt.quarter
        self.df['Year'] = self.df['Date'].dt.year
        
        monthly_sales = self.df.groupby('Month')['Sales'].sum()
        quarterly_sales = self.df.groupby('Quarter')['Sales'].sum()
        
        analysis['time_analysis'] = {
            'monthly_sales': monthly_sales.to_dict(),
            'quarterly_sales': quarterly_sales.to_dict()
        }
        
        # Customer analysis
        customer_analysis = self.df.groupby('Customer_ID')['Sales'].agg([
            'sum', 'count', 'mean'
        ]).round(2)
        analysis['customer_analysis'] = {
            'top_customers': customer_analysis.nlargest(10, 'sum').to_dict(),
            'total_customers': customer_analysis.shape[0]
        }
        
        # Discount analysis
        discount_analysis = self.df.groupby('Discount')['Sales'].agg([
            'sum', 'mean', 'count'
        ]).round(2)
        analysis['discount_analysis'] = discount_analysis.to_dict()
        
        self.analysis_results = analysis
        return analysis
    
    def create_visualizations(self):
        """Create comprehensive visualizations"""
        if self.df is None:
            print("No data loaded")
            return
        
        # Set up the plotting style
        plt.style.use('seaborn-v0_8')
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        fig.suptitle('Comprehensive Sales Analysis Dashboard', fontsize=16, fontweight='bold')
        
        # 1. Sales over time
        daily_sales = self.df.groupby('Date')['Sales'].sum()
        axes[0, 0].plot(daily_sales.index, daily_sales.values, linewidth=2)
        axes[0, 0].set_title('Sales Over Time')
        axes[0, 0].set_xlabel('Date')
        axes[0, 0].set_ylabel('Total Sales')
        axes[0, 0].tick_params(axis='x', rotation=45)
        
        # 2. Regional sales
        regional_sales = self.df.groupby('Region')['Sales'].sum()
        axes[0, 1].bar(regional_sales.index, regional_sales.values, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
        axes[0, 1].set_title('Sales by Region')
        axes[0, 1].set_xlabel('Region')
        axes[0, 1].set_ylabel('Total Sales')
        
        # 3. Product sales pie chart
        product_sales = self.df.groupby('Product')['Sales'].sum()
        axes[0, 2].pie(product_sales.values, labels=product_sales.index, autopct='%1.1f%%')
        axes[0, 2].set_title('Sales Distribution by Product')
        
        # 4. Sales distribution histogram
        axes[1, 0].hist(self.df['Sales'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
        axes[1, 0].set_title('Sales Distribution')
        axes[1, 0].set_xlabel('Sales Amount')
        axes[1, 0].set_ylabel('Frequency')
        
        # 5. Monthly sales trend
        monthly_sales = self.df.groupby('Month')['Sales'].sum()
        axes[1, 1].plot(monthly_sales.index, monthly_sales.values, marker='o', linewidth=2, markersize=8)
        axes[1, 1].set_title('Monthly Sales Trend')
        axes[1, 1].set_xlabel('Month')
        axes[1, 1].set_ylabel('Total Sales')
        axes[1, 1].set_xticks(range(1, 13))
        
        # 6. Discount impact
        discount_impact = self.df.groupby('Discount')['Sales'].mean()
        axes[1, 2].bar(discount_impact.index, discount_impact.values, color='lightcoral')
        axes[1, 2].set_title('Average Sales by Discount Level')
        axes[1, 2].set_xlabel('Discount Rate')
        axes[1, 2].set_ylabel('Average Sales')
        
        plt.tight_layout()
        plt.show()
    
    def generate_report(self):
        """Generate comprehensive analysis report"""
        if not self.analysis_results:
            print("No analysis results available")
            return
        
        report = f"""
COMPREHENSIVE SALES ANALYSIS REPORT
{'=' * 60}
Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
Data Source: {self.data_source}

BASIC STATISTICS:
- Total Records: {self.analysis_results['basic_stats']['total_records']:,}
- Total Sales: ${self.analysis_results['basic_stats']['total_sales']:,.2f}
- Average Sales: ${self.analysis_results['basic_stats']['average_sales']:,.2f}
- Median Sales: ${self.analysis_results['basic_stats']['median_sales']:,.2f}
- Standard Deviation: ${self.analysis_results['basic_stats']['std_sales']:,.2f}
- Min Sales: ${self.analysis_results['basic_stats']['min_sales']:,.2f}
- Max Sales: ${self.analysis_results['basic_stats']['max_sales']:,.2f}

TOP REGIONS BY SALES:
"""
        
        regional_data = self.analysis_results['regional_analysis']['sum']
        for region, sales in sorted(regional_data.items(), key=lambda x: x[1], reverse=True):
            report += f"- {region}: ${sales:,.2f}\n"
        
        report += "\nTOP PRODUCTS BY SALES:\n"
        product_data = self.analysis_results['product_analysis']['sum']
        for product, sales in sorted(product_data.items(), key=lambda x: x[1], reverse=True):
            report += f"- {product}: ${sales:,.2f}\n"
        
        report += f"\nCUSTOMER INSIGHTS:\n"
        report += f"- Total Unique Customers: {self.analysis_results['customer_analysis']['total_customers']}\n"
        report += f"- Top 3 Customers by Total Spending:\n"
        
        top_customers = self.analysis_results['customer_analysis']['top_customers']['sum']
        for i, (customer_id, sales) in enumerate(list(top_customers.items())[:3], 1):
            report += f"  {i}. Customer {customer_id}: ${sales:,.2f}\n"
        
        return report

# Example usage
analyzer = DataAnalysisSuite("sample_data")

# Generate sample data
analyzer.generate_sample_data(1000)

# Perform analysis
analysis = analyzer.perform_comprehensive_analysis()

# Create visualizations
analyzer.create_visualizations()

# Generate report
report = analyzer.generate_report()
print(report)

Assets

Resources

  • Python library documentation: https://docs.python.org/3/library/
  • Pandas documentation: https://pandas.pydata.org/docs/
  • NumPy documentation: https://numpy.org/doc/
  • Matplotlib documentation: https://matplotlib.org/stable/
  • Seaborn documentation: https://seaborn.pydata.org/

Summary

Python libraries extend the language’s capabilities for data science. Key concepts include importing libraries, using core data science libraries like pandas and numpy, managing dependencies, and applying libraries to real-world analysis tasks. Libraries save time and provide reliable, tested tools for complex data operations.


© 2025 Prof. Tim Frenzel. All rights reserved. | Version 1.0.5