Transforming Corporate Reporting with Google Cloud AI

Executive Summary

In today’s data-driven financial landscape, extracting meaningful insights from unstructured corporate documents remains a significant challenge. Our AI-Powered ESG & Financial Analysis Platform addresses this by automatically processing PDF reports from major corporations, extracting critical financial and ESG (Environmental, Social, and Governance) metrics, and generating intelligent forecasts using AI technologies in matter of minutes.
What can take days to read the PDFs and analyze, now takes a few minutes to process 1000s of pages.
The idea is to unlock insights from unstructured data formats like PDFs, documents, and images that are traditionally walled off from primary analytics workflows. Our solution directly tackles this challenge by creating an end-to-end pipeline that transforms corporate PDF reports into actionable intelligence.

Problem Statement

Corporate annual reports and ESG sustainability reports contain vast amounts of critical information, but this data is typically locked away in PDF formats that are difficult to analyze systematically. Financial analysts, ESG researchers, and investment professionals spend countless hours manually extracting key metrics from these documents.
Key Pain Points:

  • Manual Data Extraction: Analysts manually read through hundreds of pages to find specific metrics
  • Inconsistent Reporting: Different companies use varying formats and terminology
  • Time-Intensive Analysis: What should take minutes often takes hours or days
  • Limited Forecasting: Static historical data without predictive insights
  • Scalability Issues: Processing multiple companies across years is resource-intensive

This solution delivers 90% reduction in analyst workload, enables real-time ESG risk monitoring across investment portfolios, and provides predictive financial forecasting. The platform transforms weeks of manual analysis into minutes of automated insights, reducing operational costs by up to 75% for financial institutions.

Solution Architecture

Our solution leverages BigQuery’s comprehensive AI capabilities through a multi-stage processing pipeline:

  • OBJECT_TABLE for storing metadata index for unstructured data objects specified in Cloud Storage bucket.
  • ML.GENERATE_TEXT with Gemini 2.5 Pro for fetching document data and metrics from all reports (saved as response_text)
  • AI.GENERATE for interactive query processing and detailed analysis from unstructured text (response_text) generated from above
  • AI.GENERATE_TABLE for structured data extraction from unstructured text (response_text) – saved as metrics_table
  • AI.FORECAST with TimesFM 2.0 for predictive revenue modeling from metrics_table
  • BIGFRAMES for analyzing the structured data and plotting graphs

Data Processing Workflow

  • External Bigquery tables to connect to Google Cloud Storage (gs://report_insights)
  • Automatic metadata extraction and file categorization
  • 52 total files processed from pharmaceutical companies: Novartis, GlaxoSmithKline, Pfizer, Amgen
  • Flexible company selection system supporting dynamic analysis

SELECT
  *
FROM
  ML.GENERATE_TEXT(
    MODEL `{QUALIFIED_MODEL_ID}`,
    TABLE `{QUALIFIED_OBJ_TABLE_ID}`,
    STRUCT(
      '''
        You are an expert ESG and Financial analyst. Use only the information provided in the document to answer. Fetch a very detailed financial sustainability details and metrics from text, tabular and image data for each PDF. 
      ''' AS prompt,
      0 AS temperature,
      8092 AS max_output_tokens   ) ));

Processing Result:

  • 52 corporate documents successfully analyzed
  • Comprehensive prompt engineering for both financial and ESG metrics
  • Context-aware processing: Financial focus for annual reports, ESG focus for sustainability reports
  • Creates all_reports_ai_text_raw table with initial AI analysis

SELECT
  *,
  REGEXP_REPLACE(
    ARRAY_TO_STRING(
      ARRAY(
        SELECT JSON_VALUE(part, '$.text')
        FROM UNNEST(JSON_QUERY_ARRAY(
          JSON_QUERY(ml_generate_text_result, '$.candidates[0].content.parts')
        )) AS part
      ),
      ''
    ),
    r'```(json)?|```',
    ''
  ) AS response_text
FROM `{QUALIFIED_RAW_TABLE_ID}`;
  • Advanced JSON response parsing from complex AI model outputs
  • REGEX-based cleaning to remove markdown artifacts and formatting
  • Text normalization ensuring data quality
  • Creates all_reports_ai_text_curated table with clean response text

For example: the question_block can be a huge list of varied questions related to Finances and/or ESG

SELECT
    response_text,
    AI.GENERATE(
    (
      CONCAT({question_block}),
    response_text),
    connection_id => '{PROJECT_LOCATION}.{CLOUD_RES_CONN}',
  endpoint => '{MODEL_ENDPOINT}'
).result
FROM `{QUALIFIED_CURATED_TABLE_ID}`
WHERE (uri like '%{selected_company}%')
AND uri like '%{selected_year}%'
AND uri like '%{selected_report_type}%';

Processing Result:

  • Interactive Intelligence: Answers 50+ predefined financial and ESG questions per document
  • Dynamic Query System: Users can input custom questions for specific analysis
  • Contextual Processing: Different question sets for Annual vs. Sustainability reports
  • Real-time Analysis: Live demonstration shows detailed GSK 2024 financial analysis

   SELECT *
    FROM AI.GENERATE_TABLE(
        MODEL `{QUALIFIED_MODEL_ID}`,
        (
            SELECT CONCAT(
                "You are an expert financial analyst and ESG/Climate analyst. Extract all financial metrics and sustainability metrics and KPIs from the provided text into structured data. ",
                "TEXT: ",
                response_text
            ) AS prompt,
            response_text as extracted_text,
            uri AS uri
        FROM `{QUALIFIED_CURATED_TABLE_ID}`
        ),
        STRUCT(
            'uris STRING, report_type STRING, company_name STRING, fiscal_year INT64, revenue_millions FLOAT64, net_income_millions FLOAT64, total_assets_millions FLOAT64, total_liabilities_millions FLOAT64, equity_millions FLOAT64, eps_basic FLOAT64,....'
            AS output_schema,
            8192 AS max_output_tokens,
            0 AS temperature
        ) 

Processing Result:

  • Processing Time: 3-4 minutes for complete dataset structuring
  • Schema: 50+ structured fields including revenue, assets, emissions, employee metrics
  • Data Normalization: Automatic currency conversion and number formatting
  • Company Detection: Intelligent company name standardization and fiscal year extraction

  • Intelligent Separation: Financial metrics active only for Annual reports, ESG metrics only for Sustainability reports
  • Data Integrity: Prevents contamination between report types using conditional CASE statements
  • Zero-Value Logic: Strategic assignment ensures clean metric categorization
  • Creates final metrics table: all_reports_metrics with properly categorized data

SELECT
  *
FROM
  AI.FORECAST(
    TABLE `{QUALIFIED_METRICS_TABLE_ID}`,
    data_col => 'revenue_millions',
    timestamp_col => 'fiscal_year_date',
    model => 'TimesFM 2.0',
    id_cols => ['company_name'],
    horizon => 5,
    confidence_level => .95
  )

Processing Result:

  • 5-year Revenue Forecasting: TimesFM 2.0 generates predictive models with 95% confidence intervals
  • Multi-company Modeling: Separate forecasts for each pharmaceutical company
  • Automated Trend Analysis: Pattern recognition across historical data

forecast_bf_query = f"""
SELECT *
FROM `{QUALIFIED_FORECAST_TABLE_ID}`
"""
forecast_df = bpd.read_gbq(forecast_bf_query)


# Group by company and timestamp, take mean (if multiple values exist per date)
company_forecasts = (
    forecast_df[["company_name", "forecast_timestamp", "forecast_value"]]
    .groupby(["company_name", "forecast_timestamp"])
    .mean()
    .reset_index()
)

Architecture Overview

Technical Implementation

Results and Impact

Business Value Proposition

Innovation Highlights

Technical Architecture Deep Dive

Future Roadmap

Conclusion