Sitemap Strategy for Content-Heavy API-First Sites

1 min read
Cover image for Untitled Paper

Most developers treat a sitemap as a checkbox. Generate the XML, submit it to Search Console, move on. For a site with 20 static pages, that's fine.

For an API-first platform where content is dynamic, user-generated, and growing that approach breaks down fast. A stale sitemap, a sitemap that includes non-indexable pages, or no sitemap at all means slower indexation, wasted crawl budget, and content that exists in your database but never appears in search results.

A good sitemap strategy is an active part of your SEO infrastructure, not a one-time setup task.


Why Sitemaps Matter More on API-First Sites

On a static site, Google can discover all your pages by following links. Your nav links to your blog, your blog lists all your posts, your posts link to each other. Crawlers follow the trail.

On an API-first platform, content discovery is more complex. User profiles, project pages, and individual papers don't necessarily appear in a navigable link trail from your homepage. A paper by a user with no external backlinks might only be discoverable if Google crawls their profile page, finds the project, expands a collection, and reaches the paper assuming all of that is server-rendered HTML, not client-side JavaScript.

A sitemap short-circuits all of this. It's a direct list of URLs you want Google to know about, no crawling required. For platforms with user-generated content, it's not optional it's the primary way Google learns what exists.


Segmented Sitemaps vs. One Big Sitemap

The naive approach is one sitemap with all your URLs. This works up to 50,000 URLs (Google's per-sitemap limit), but it creates monitoring problems even before you hit that limit.

If your one sitemap has 10,000 URLs and Search Console shows that 6,000 are indexed and 4,000 are not, you have no idea whether the non-indexed URLs are marketing pages, user profiles, or papers. You can't diagnose the problem.

Segmented sitemaps solve this. Split your sitemap by content type:

  • /sitemaps/pages.xml static marketing and product pages
  • /sitemaps/profiles.xml public user profile pages
  • /sitemaps/projects.xml public project pages
  • /sitemaps/blog.xml blog posts

Then create a sitemap index that references all of them:

<!-- /sitemap-index.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://whitepaper.so/sitemaps/pages.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://whitepaper.so/sitemaps/profiles.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://whitepaper.so/sitemaps/projects.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://whitepaper.so/sitemaps/blog.xml</loc>
  </sitemap>
</sitemapindex>

Submit the index URL to Search Console. Now you can monitor indexation rates per segment and diagnose problems by content type.


What to Include (and What Not to Include)

A sitemap is not a complete inventory of every URL on your site. It's a curated list of pages you want Google to index. Including the wrong pages wastes crawl budget and can confuse Google's indexation decisions.

Include:

  • All public-facing pages with real, unique content
  • User profiles where the user has at least one published piece of content
  • Project pages that are set to public
  • All published blog posts
  • Key marketing pages (features, pricing, use cases, compare)

Exclude:

  • Pages behind authentication
  • Pages with noindex meta tags
  • Pagination pages for the same content (only include page 1, optionally)
  • Draft or unpublished content
  • Admin or settings pages
  • 404 and error pages
  • Redirect target URLs that themselves redirect elsewhere

For user-generated content, apply a quality filter before including in the sitemap. A profile page with zero published content doesn't deserve to be indexed. Include profiles only once they have at least one public paper or project.


Generating Sitemaps Dynamically From Your API

On an API-first platform, sitemaps must be generated programmatically. They can't be static files because the content changes constantly.

In Astro, you create a server-rendered endpoint that fetches your content and outputs XML:

// src/pages/sitemaps/projects.xml.ts
import type { APIRoute } from 'astro';

export const GET: APIRoute = async () => {
  // Fetch all public projects from your API
  const projects = await fetchPublicProjects();

  const entries = projects
    .filter(p => p.isPublic && p.paperCount > 0)
    .map(p => `
      <url>
        <loc>https://whitepaper.so/${p.handle}/p/${p.slug}</loc>
        <lastmod>${new Date(p.updatedAt).toISOString().split('T')[0]}</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.7</priority>
      </url>
    `).join('');

  const xml = `<?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      ${entries}
    </urlset>`;

  return new Response(xml, {
    headers: { 'Content-Type': 'application/xml' }
  });
};

This endpoint is called every time a bot (or you) requests the sitemap. It always reflects current content.

For large datasets, add caching. Generate the sitemap XML on a schedule (every hour, every day) and serve the cached version. Regenerating a sitemap for 50,000 URLs on every request is expensive.


The lastmod Field and Crawl Prioritization

The <lastmod> field in a sitemap entry tells Google when this URL was last updated. Google uses this to prioritize which pages to recrawl first recently updated content gets recrawled faster.

Always include lastmod, and always use real dates from your content's updatedAt timestamp. Don't fake it or hardcode today's date on all entries Google learns to ignore lastmod values it can't trust, and it can detect when all your entries have the same date.

When you update a paper or project, the updatedAt timestamp changes. Your sitemap picks this up automatically on the next generation. Google sees the changed lastmod, recrawls the page sooner, and picks up the new content.


Robots.txt Must Reference Your Sitemap

Your robots.txt file should reference your sitemap index:

User-agent: *
Allow: /

Sitemap: https://whitepaper.so/sitemap-index.xml

This tells every crawler not just Googlebot where to find your sitemap. Even if you've submitted it manually to Search Console, the robots.txt reference ensures other crawlers (Bingbot, DuckDuckBot) can find it too.

Also use robots.txt to block paths that should never be crawled authenticated dashboards, API endpoints, internal admin paths. This protects your crawl budget for pages that actually matter.


Monitoring Your Sitemap in Search Console

Submitting a sitemap is the beginning, not the end. Search Console's Index Coverage report shows you:

  • How many URLs from your sitemap are indexed
  • How many are excluded and why
  • How many have errors

Check this monthly. A sitemap with 2,000 submitted URLs and only 400 indexed means 1,600 pages Google has seen but decided not to index. That's worth investigating. Common reasons:

  • "Discovered, currently not indexed" Google knows about it but hasn't crawled it yet. Usually resolves with time for new content.
  • "Crawled, currently not indexed" Google crawled it and decided not to index it. Usually a content quality or thin content issue.
  • "Duplicate without user-selected canonical" Google found a URL variant it considers the same as another page and chose the other version. Your canonical tags or redirect rules need fixing.

Each of these has a different fix. The segmented sitemap structure makes it easy to see which content types have the most problems.


Keeping Sitemaps Current

The biggest operational failure with sitemaps is neglect. Someone sets it up once, it works, and then six months later half the URLs are broken because routes changed, content was deleted, or the API endpoint the sitemap generator calls stopped working.

Add sitemap health to your monitoring:

  • Set a scheduled check that fetches your sitemap and confirms it returns 200
  • Alert if the URL count drops significantly (could indicate a generation failure)
  • Review Search Console's submitted vs. indexed counts monthly

A sitemap that's broken or stale is often worse than no sitemap it wastes Google's crawl budget on 404 pages and signals that you don't maintain your site carefully.


Part of our series on Programmatic SEO for API-First CMS. Read the full pillar guide or learn about generating SEO metadata automatically.

Whitepapper

Whitepapper is an API first CMS engine for devs who publish content.

Antk i n