This Website is Hosted on Bluesky

Well, not this one. But this one is! How? Let’s take a closer look at Bluesky and the AT Protocol that underpins it.

Note: I communicated with the Bluesky team prior to the publishing of this post. While the functionality described is not the intended use of the application, it is known behavior and does not constitue a vulnerability disclosure process. My main motivation for reaching out to them was because I like the folks and don’t want to make their lives harder.

Being able to host a website on Bluesky really has very little to do with Bluesky itself. I happen to use Bluesky for hosting my Personal Data Server (PDS), but all of the APIs leveraged in uploading the site contents are defined at the AT Protocol level and implemented by a PDS. Bluesky offers access to my PDS via their PDS entryway, which allows for the many (have you heard that they are growing by a million users per day?) PDS instances they run to be exposed via the bsky.social domain. That being said, individual PDS instances can be accessed directly, and if you clicked the link at the top of this post to access the Bluesky hosted website, then you have already visited mine at porcini.us-east.host.bsky.network.

Most social applications, and many applications in general for that matter, broadly have two primary types of content: records and blobs. Records are the core entity types that users create. They generally have some defined structure and metadata, and they may reference other records or content. Blobs are typically larger unstructured data, such as media assets, that may be uploaded by a user, but are exposed via a record referencing them. For example, on Bluesky a user may upload an image, then create a post that references it. From an end-user perspective, these two operations appear to be one action, but they are typically decoupled at the API level.

This decoupling is described in detail in the AT Protocol blob specification.

Blob files are uploaded and distributed separately from records. Blobs are authoritatively stored by the account’s PDS instance, but views are commonly served by CDNs associated with individual applications (“AppViews”), to reduce traffic on the PDS. CDNs may serve transformed (resized, transcoded, etc) versions of the original blob.

Later on, the specification details how blob lifecylce is to be managed.

Blobs must be uploaded to the PDS before a record can be created referencing that blob. Note that the server does not know the intended Lexicon when receiving an upload, so can only apply generic blob limits and restrictions at initial upload time, and then enforce Lexicon-defined limits later when the record is created.

Reading this section is what initially got my wheels turning. While Bluesky has a limited set of media asset types that can be referenced by posts, posts are just one record type that is defined by the Bluesky lexicon (app.bsky.*). Records, on the other hand, are defined in the AT Protocol lexicon (com.atproto.*) and are designed to accommodate creating any type of record defined by any lexicon. Because different types of blobs may be relevant for other lexicons, the specification highlights that restrictions cannot be enforced at time of upload.

Instead blobs are not made available until they are referenced, at which point the validation can be performed based on the lexicon of the record type.

After a successful upload, blobs are placed in temporary storage. They are not accessible for download or distribution while in this state. Servers should “garbage collect” (delete) un-referenced temporary blobs after an appropriate time span (see implementation guidelines below). Blobs which are in temporary storage should not be included in the listBlobs output.

The upload blob can now be referenced from records by including the returned blob metadata in a record. When processing record creation, the server extracts the set of all referenced blobs, and checks that they are either already referenced, or are in temporary storage. Once the record creation succeeds, the server makes the blob publicly accessible.

However, applying validation does not mean that Bluesky’s restrictions will necessarily be applied. A record that references a blob could very well be of a type defined by a different lexicon, or, as we’ll see later on, part of a sub-schema enabled by an open union in the Bluesky lexicon. Let’s see how this works in practice.

In order to perform data creation operations against a PDS, an access token must be acquired for authentication. The com.atproto.server.createSession XRPC method can be used to exchange user credentials for a token. In the following curl command, I used danielmangum.com as $BSKY_HANDLE and my password as $BSKY_PWD.

curl -X POST 'https://bsky.social/xrpc/com.atproto.server.createSession' \
-H 'Content-Type: application/json' \
-d '{"identifier": "'"$BSKY_HANDLE"'", "password": "'"$BSKY_PWD"'"}'

The response includes an accessJWT field, which will be used as $ACCESS_JWT in subsequent operations. As described in the blob specification, a blob must be uploaded prior to it being referenced. I wanted to verify that the blob was not present in the com.atproto.sync.listBlobs output, or accessible via the com.atproto.sync.getBlob methods immediately after upload, so I checked how many blobs were currently being returned.

curl -s 'https://bsky.social/xrpc/com.atproto.sync.listBlobs?did='"$DID"'' \
-H 'Authorization: Bearer '"$ACCESS_JWT"'' | jq -r '.cids | length'

The decentralized identifier ($DID) used above can be obtained from the createSession output as well. It is the underlying identifier for an account. Every Bluesky handle resolves to a DID.

The com.atproto.repo.uploadBlob method is used to upload a blob to a repository. The content of the website is a simple index.html file.

<h1>This Website is Hosted on Bluesky</h1>

<p>
This website is just a blob uploaded to Bluesky via the API. Curious about how
this works? Check out the write-up on <a
href="https://danielmangum.com/posts/this-website-is-hosted-on-bluesky/">danielmangum.com</a>.
</p>

To upload it, I used the following command.

curl -X POST 'https://bsky.social/xrpc/com.atproto.repo.uploadBlob' \
-H 'Authorization: Bearer '"$ACCESS_JWT"'' \
-H 'Content-Type: text/html' \
--data-binary '@index.html'

{
  "blob": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq"
    },
    "mimeType": "text/html",
    "size": 268
  }
}

The returned $link can be used as the Content Identifier (cid) when fetching the blob via the getBlob method. However, according to the specification, because this blob has to be referenced, it shouldn’t be visible. I checked to see if I could access it with the following command.

curl -L 'https://bsky.social/xrpc/com.atproto.sync.getBlob?did='"$DID"'&cid='"$LINK"''

{
  "error": "InternalServerError",
  "message": "Internal Server Error"
}

Not the error I was expecting, but it looks like I indeed cannot access it. I was also able to determine that it had not beed added to the listBlobs output.

curl -s 'https://bsky.social/xrpc/com.atproto.sync.listBlobs?did='"$DID"'' \
-H 'Authorization: Bearer '"$ACCESS_JWT"'' | jq -r '.cids | length'

Blobs can be referenced in app.bsky.feed.post records on Bluesky by including an embedded image. However, the app.bsky.embed.image schema retricts the MIME type to those prefixed with image/*. We can see this validation in action if we try to create a post with an embedded image.

curl -X POST 'https://bsky.social/xrpc/com.atproto.repo.createRecord' \
-H 'Authorization: Bearer '"$ACCESS_JWT"'' \
-H 'Content-Type: application/json' \
-d '{
  "repo": "danielmangum.com",
  "collection": "app.bsky.feed.post",
  "record": {
    "$type": "app.bsky.feed.post",
    "text": "testing123",
    "createdAt": "2024-11-23T05:49:35.422015Z",
    "embed": {
      "$type": "app.bsky.embed.images",
      "images": [
        {
          "alt": "that is not an image that is a website!",
          "image": {
            "$type": "blob",
            "ref": {
              "$link": "bafkreidphtuvbzublyzacxukmmk2ikiur5ahme75fegokbhh26o4wfzvry"
            },
            "mimeType": "text/html",
            "size": 21
          }
        }
      ]
    }
  }
}'

{
  "error": "InvalidMimeType",
  "message": "Wrong type of file. It is text/html but it must match image/*."
}

For completeness, I also tried specifying the mimeType as image/jpeg and verified that the PDS also validates that the blob reference MIME type matches the blob.

{
  "error": "InvalidMimeType",
  "message": "Referenced Mimetype does not match stored blob. Expected: text/html, Got: image/jpeg"
}

However, the blob $type is part of the AT Protocol data model and not specific to Bluesky. Because Bluesky’s PDS implementation is open source, we can see exactly how a BlobRef is defined.

export class BlobRef {
  public original: JsonBlobRef

  constructor(
    public ref: CID,
    public mimeType: string,
    public size: number,
    original?: JsonBlobRef,
  ) {
    this.original = original ?? {
      $type: 'blob',
      ref,
      mimeType,
      size,
    }
  }

We can also see exactly how the PDS identifies blobs in a record.

export const findBlobRefs = (
  val: LexValue,
  path: string[] = [],
  layer = 0,
): FoundBlobRef[] => {
  if (layer > 32) {
    return []
  }
  // walk arrays
  if (Array.isArray(val)) {
    return val.flatMap((item) => findBlobRefs(item, path, layer + 1))
  }
  // objects
  if (val && typeof val === 'object') {
    // convert blobs, leaving the original encoding so that we don't change CIDs on re-encode
    if (val instanceof BlobRef) {
      return [
        {
          ref: val,
          path,
        },
      ]
    }
    // retain cids & bytes
    if (CID.asCID(val) || val instanceof Uint8Array) {
      return []
    }
    return Object.entries(val).flatMap(([key, item]) =>
      findBlobRefs(item, [...path, key], layer + 1),
    )
  }
  // pass through
  return []
}

The important thing to notice is that identifying blob references does require the presence of a lexicon schema. findBlobRefs recursively navigates a LexValue and looks for $type: blob. In order to support new lexicons over time, the PDS needs to be able to handle lexicons that it doesn’t know about. Because blobs are a fundamental component of so many applications, these new lexicons also need to be able to leverage them. To put this into action, I attempted to create a record of type com.danielmangum.hack.website, which included a reference to the uploaded HTML blob.

curl -X POST 'https://bsky.social/xrpc/com.atproto.repo.createRecord' \
-H 'Authorization: Bearer '"$ACCESS_JWT"'' \
-H 'Content-Type: application/json' \
-d '{
  "repo": "danielmangum.com",
  "collection": "com.danielmangum.hack.website",
  "record": {
    "$type": "com.danielmangum.hack.website",
    "website": {
      "$type": "blob",
      "ref": {
        "$link": "bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq"
      },
      "mimeType": "text/html",
      "size": 268
    }
  }
}'

{
  "uri": "at://did:plc:j22nebhg6aek3kt2mex5ng7e/com.danielmangum.hack.website/3lbnguuzckm2u",
  "cid": "bafyreicjcptshc7lmgb7abxlvcb5fmqqjdj6neie23szyum7rcaowmm5qm",
  "commit": {
    "cid": "bafyreid6apjjy56xoyenxmg5xv356twh22n3hayecoxlf6mflpltlzpuwu",
    "rev": "3lbnguuzmd42u"
  },
  "validationStatus": "unknown"
}

It worked! We can see in the response that the PDS was unable to validate the record (validationStatus: unknown) because it does not know about the com.danielmangum.hack.* lexicon. Nevertheless, it will agree to persist the record. The next step was to check whether the referenced blob had been persisted.

curl -s 'https://bsky.social/xrpc/com.atproto.sync.listBlobs?did='"$DID"'' \
-H 'Authorization: Bearer '"$ACCESS_JWT"'' | jq -r '.cids | length'

It looked like it had as the count had increased by 1. Fetching the blob directly would tell us for sure. Importantly, getBlob does not require passing the $ACCESS_JWT because unauthenticated parties need to be able to fetch blobs to process alongside records that reference them.

curl 'https://bsky.social/xrpc/com.atproto.sync.getBlob?did='"$DID"'&cid=bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq'

{
  "error": "Redirecting",
  "message": "Redirecting to new blob location"
}

Adding -L to the command enables following redirects.

curl -L 'https://bsky.social/xrpc/com.atproto.sync.getBlob?did='"$DID"'&cid=bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq'

<h1>This Website is Hosted on Bluesky</h1>

<p>
This website is just a blob uploaded to Bluesky via the API. Curious about how
this works? Check out the write-up on <a
href="https://danielmangum.com/posts/this-website-is-hosted-on-bluesky/">danielmangum.com</a>.
</p>

Examining the redirect response, we can see that we are being directed directly to my PDS.

< HTTP/2 302 
< date: Sun, 24 Nov 2024 13:58:21 GMT
< content-type: application/json; charset=utf-8
< content-length: 68
< location: https://porcini.us-east.host.bsky.network/xrpc/com.atproto.sync.getBlob?did=did:plc:j22nebhg6aek3kt2mex5ng7e&cid=bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq
< x-powered-by: Express
< access-control-allow-origin: *
< ratelimit-limit: 3000
< ratelimit-remaining: 2997
< ratelimit-reset: 1732456898
< ratelimit-policy: 3000;w=300
< etag: W/"44-1je7JKzDJZFd5iRtOI+IS+zlOOE"
< vary: Accept-Encoding

Opening the location URL in the browser presents the website as expected, And just like that, we have a website hosted on Bluesky! While this is not really the intended use of blobs on Bluesky specifically, it could be a legitimate use case in the future. Records that reference website content, code, or other binary artifacts are a possibility on the AT Protocol. That being said, if a service like Bluesky is running PDS instances on behalf of users, this effectively equates to free (albiet unreliable) arbiratry file hosting, which has implications beyond just racking up large storage and egress data fees. Returning back to the blobs specification, there is an additional section on Security Considerations.

Serving arbitrary user-uploaded files from a web server raises many content security issues. For example, cross-site scripting (XSS) of scripts or SVG content form the same “origin” as other web pages. It is effectively mandatory to enable a Content Security Policy (LINK: https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) for the getBlob endpoint. It is effectively not supported to dynamically serve assets directly out of blob storage (the getBlob endpoint) directly to browsers and web applications. Applications must proxy blobs, files, and assets through an independent CDN, proxy, or other web service before serving to browsers and web agents, and such services are expected to implement security precautions.

Bluesky does apply recommended CSP headers to the endpoint in the handler, which guards against some of the issues described.

      res.setHeader('x-content-type-options', 'nosniff')
      res.setHeader('content-security-policy', `default-src 'none'; sandbox`)

There is also a default size limit on blob of 5 MB.

    blobUploadLimit: env.blobUploadLimit ?? 5 * 1024 * 1024, // 5mb

Images, the most common blob type on the Bluesky application, are expectedly not served directly from PDS instances, but from the Bluesky CDN. For example, the following URL points to the feed thumbnail version of an image I recently uploaded as part of a post.

https://cdn.bsky.app/img/feed_thumbnail/plain/did:plc:j22nebhg6aek3kt2mex5ng7e/bafkreie5ci75iujpv34slnh3o4b7xcuxklpcguzed6qqmi4eaagn6cg4ve@jpeg

A different URL provides the full size version.

https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:j22nebhg6aek3kt2mex5ng7e/bafkreie5ci75iujpv34slnh3o4b7xcuxklpcguzed6qqmi4eaagn6cg4ve@jpeg

However, the post that references the image just includes the cid. The application itself needs to be aware of how images are served from the CDN.

{
  "$type": "app.bsky.feed.post",
  "createdAt": "2024-11-12T14:18:44.263Z",
  "embed": {
    "$type": "app.bsky.embed.images",
    "images": [
      {
        "alt": "Title image for blog post \"USB On-The-Go on the ESP32-S3\" on danielmangum.com.",
        "aspectRatio": {
          "height": 1080,
          "width": 1920
        },
        "image": {
          "$type": "blob",
          "ref": {
            "$link": "bafkreie5ci75iujpv34slnh3o4b7xcuxklpcguzed6qqmi4eaagn6cg4ve"
          },
          "mimeType": "image/jpeg",
          "size": 869901
        }
      }
    ]
  },
  "facets": [
    {
      "features": [
        {
          "$type": "app.bsky.richtext.facet#link",
          "uri": "https://danielmangum.com/posts/usb-otg-esp32s3/"
        }
      ],
      "index": {
        "byteEnd": 261,
        "byteStart": 229
      }
    }
  ],
  "langs": [
    "en"
  ],
  "text": "ICYMI: This weekend I wrote about USB On-The-Go on the ESP32-S3. OTG allows devices to also act as USB hosts. I dive into how the USB PHY is configured, and demonstrate connecting two ESP32-S3's, as well as a Raspberry Pi Pico.\n\ndanielmangum.com/posts/usb-ot..."
}

The logic is present in the ImageUriBuilder, which will use a CDN if one is configured.

    const imgUriBuilder = new ImageUriBuilder(
      config.cdnUrl || `${config.publicUrl}/img`,
    )

So why does Bluesky provide direct unauthenticated access to the PDS getBlobs endpoint? Once again illustrating the beauty of open source, there is an issue describing the original motivation. In it, image labeling and user content export, as well as additional future use cases, are enumerated. There is also a mention of the possibility of users hotlinking content and Bluesky for free hosting, so these issues are clearly top-of-mind. The original implementation did not include the proper security headers, but they were subsequently added.

Traditional social platforms can place more restrictions on blobs at time of upload because there is a limited set of valid content. The extensibility of Bluesky and the AT Protocol, which is what differentiates it from traditional networks, also necessitates more complexity. However, I, and clearly the awesome folks building Bluesky, think it’s clearly worth it.

Bonus Content Link to heading

I mentioned sub-schemas and open unions earlier in this post. The app.bsky.feed.post type includes a union for valid embeds. Per the AT Protocol lexicon specification, unions are open unless explicitly marked as closed.

By default unions are “open”, meaning that future revisions of the schema could add more types to the list of refs (though can not remove types). This means that implementations should be permissive when validating, in case they do not have the most recent version of the Lexicon. The closed flag (boolean) can indicate that the set of types is fixed and can not be extended in the future.

The embed union is not marked as closed.

          "embed": {
            "type": "union",
            "refs": [
              "app.bsky.embed.images",
              "app.bsky.embed.video",
              "app.bsky.embed.external",
              "app.bsky.embed.record",
              "app.bsky.embed.recordWithMedia"
            ]
          },

Therefore, posts can be created with an embed $type that is not enumerated. For example, I could also persist the website HTML via making a post on Bluesky with a custom embed.

curl -X POST 'https://bsky.social/xrpc/com.atproto.repo.createRecord' \
-H 'Authorization: Bearer '"$ACCESS_JWT"'' \
-H 'Content-Type: application/json' \
-d '{
  "repo": "danielmangum.com",
  "collection": "app.bsky.feed.post",
  "record": {
    "$type": "app.bsky.feed.post",
    "text": "This post embeds a website.",
    "createdAt": "2024-11-23T05:49:35.422015Z",
    "embed": {
      "$type": "com.danielmangum.hack.sites",
      "sites": [
        {
          "site": {
            "$type": "blob",
            "ref": {
              "$link": "bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq"
            },
            "mimeType": "text/html",
            "size": 268
          }
        }
      ]
    }
  }
}'

{
  "uri": "at://did:plc:j22nebhg6aek3kt2mex5ng7e/app.bsky.feed.post/3lbpfxwnjoq23",
  "cid": "bafyreidnlyhcvlzl5hc3btih6ly5anjld6ss4bgocyichnm72cpnjuzsvu",
  "commit": {
    "cid": "bafyreibv77m3bdyywmotn7ncbbrqv6pv7irzmw27bzklt4tppgsoodarma",
    "rev": "3lbpfxwo57q23"
  },
  "validationStatus": "valid"
}

In the Bluesky application, the embed is silently ignored.

However, the content is persisted and the reference is included in the post record, so a different application could choose to start rendering the embed.

{
  "$type": "app.bsky.feed.post",
  "createdAt": "2024-11-23T05:49:35.422015Z",
  "embed": {
    "$type": "com.danielmangum.hack.sites",
    "sites": [
      {
        "site": {
          "$type": "blob",
          "ref": {
            "$link": "bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq"
          },
          "mimeType": "text/html",
          "size": 268
        }
      }
    ]
  },
  "text": "This post embeds a website."
}

In my opinion, this is one of the most interesting features of lexicons because it allows for “micro-extensions” that build on existing use cases (e.g. “microblogging”). For example, I for one would love a world in which small code snippets could be embedded in my posts and run in a WebAssembly sandbox by other users. But that’s a post for another day.