Python for Data Science and Machine Learning: A Comprehensive Guide

Data Science and Machine Learning have become essential skills in modern software development. Python’s rich ecosystem of libraries and frameworks makes it the perfect language for data analysis, visualization, and building AI models.

In this comprehensive guide, we’ll explore how to use Python’s most popular data science libraries and implement common machine learning algorithms.

Key Topics

Data Analysis: NumPy and Pandas
Data Visualization: Matplotlib and Seaborn
Machine Learning: Scikit-learn
Deep Learning: TensorFlow and Keras
Model Deployment: Flask and FastAPI

1. Data Analysis with NumPy and Pandas

Master the fundamental libraries for data manipulation.

NumPy Basics

# @filename: main.py


# Creating arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.zeros((3, 3))
arr3 = np.ones((2, 4))
arr4 = np.random.rand(3, 3)

# Array operations
print(arr1 * 2)          # Element-wise multiplication
print(arr1.mean())       # Mean
print(arr1.std())        # Standard deviation
print(arr1.reshape(5,1)) # Reshape array

# Matrix operations
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

print(matrix1.dot(matrix2))  # Matrix multiplication
print(np.linalg.inv(matrix1))  # Matrix inverse
print(np.linalg.det(matrix1))  # Determinant

Pandas Data Analysis

# @filename: main.py


# Reading data
df = pd.read_csv('data.csv')

# Basic operations
print(df.head())        # First 5 rows
print(df.describe())    # Statistical summary
print(df.info())        # DataFrame info

# Data cleaning
df = df.dropna()                    # Remove missing values
df = df.fillna(df.mean())           # Fill missing values with mean
df = pd.get_dummies(df, columns=['category'])  # One-hot encoding

# Data manipulation
# Group by and aggregate
grouped = df.groupby('category').agg({
    'price': ['mean', 'min', 'max'],
    'quantity': 'sum'
})

# Merging dataframes
df_merged = pd.merge(df1, df2, on='id', how='left')

# Time series analysis
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
monthly = df.resample('M').mean()

# Complex operations
def custom_function(x):
    return x.mean() if x.dtype == 'float64' else x.mode()[0]

result = df.groupby('category').agg(custom_function)

2. Data Visualization

Create insightful visualizations of your data.

Matplotlib Plotting

# @filename: main.py

# Basic plotting
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', label='Data')
plt.title('Simple Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.grid(True)
plt.show()

# Multiple subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(x1, y1, 'r-')
ax1.set_title('Plot 1')

ax2.scatter(x2, y2)
ax2.set_title('Plot 2')

plt.tight_layout()
plt.show()

Seaborn Visualization

# @filename: main.py
# Set style
sns.set_style("whitegrid")

# Distribution plot
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='value', hue='category', multiple="stack")
plt.title('Distribution by Category')
plt.show()

# Complex visualizations
# Pair plot
sns.pairplot(df, hue='category', diag_kind='kde')
plt.show()

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Box plot
plt.figure(figsize=(12, 6))
sns.boxplot(x='category', y='value', data=df)
plt.title('Value Distribution by Category')
plt.xticks(rotation=45)
plt.show()

3. Machine Learning with Scikit-learn

Implement common machine learning algorithms.

Classification Example

# @filename: Dockerfile
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix


# Prepare data
X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance)

Regression Example

# @filename: Dockerfile
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

# Prepare data
X = df[['feature1', 'feature2']]
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train model
model = LinearRegression()
model.fit(X_train_poly, y_train)

# Make predictions
y_pred = model.predict(X_test_poly)

# Evaluate model
print("R² Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

4. Deep Learning with TensorFlow

Build and train neural networks.

Neural Network Implementation

# @filename: Dockerfile

from tensorflow.keras import layers, models
from tensorflow.keras.callbacks import EarlyStopping

# Build model
model = models.Sequential([
    layers.Dense(128, activation='relu', input_shape=(num_features,)),
    layers.Dropout(0.3),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Define callbacks
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

# Train model
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stopping]
)

# Evaluate model
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")

# Plot training history
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

Convolutional Neural Network (CNN)

# @filename: main.py
# Build CNN model
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# Compile model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train model
history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=64,
    validation_split=0.2
)

5. Model Deployment

Deploy your models using web frameworks.

Flask API

# @filename: Dockerfile
from flask import Flask, request, jsonify


app = Flask(__name__)
model = joblib.load('model.pkl')
scaler = joblib.load('scaler.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        data = request.get_json()
        features = pd.DataFrame([data])

        # Preprocess
        features_scaled = scaler.transform(features)

        # Make prediction
        prediction = model.predict(features_scaled)

        return jsonify({
            'prediction': prediction.tolist(),
            'status': 'success'
        })

    except Exception as e:
        return jsonify({
            'error': str(e),
            'status': 'error'
        }), 400

if __name__ == '__main__':
    app.run(debug=True)

FastAPI Implementation

# @filename: server.js
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel


app = FastAPI()
model = joblib.load('model.pkl')
scaler = joblib.load('scaler.pkl')

class PredictionRequest(BaseModel):
    features: list[float]

class PredictionResponse(BaseModel):
    prediction: float
    probability: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    try:
        features = np.array(request.features).reshape(1, -1)
        features_scaled = scaler.transform(features)

        prediction = model.predict(features_scaled)[0]
        probability = model.predict_proba(features_scaled)[0].max()

        return PredictionResponse(
            prediction=float(prediction),
            probability=float(probability)
        )

    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))

Best Practices

Data Preprocessing
- Handle missing values appropriately
- Scale features when needed
- Split data properly
- Validate assumptions
Model Development
- Start with simple models
- Use cross-validation
- Monitor for overfitting
- Document your process
Model Evaluation
- Use appropriate metrics
- Consider business impact
- Validate on test set
- Monitor performance
Deployment
- Version your models
- Monitor in production
- Handle errors gracefully
- Scale appropriately

Conclusion

Python’s data science and machine learning ecosystem provides powerful tools for analyzing data and building AI models. By mastering these libraries and following best practices, you can:

Analyze complex datasets effectively
Build accurate predictive models
Deploy models to production
Make data-driven decisions

Remember to start with the basics and gradually move to more complex techniques. Focus on understanding your data and choosing the right tools for your specific use case.