STORY
Pitch: PetalsLightning
AUTHOR
Joined 2023.06.27
DATE
VOTES
sats
COMMENTS

Pitch: PetalsLightning

Computational power, or "compute", is electricity transformed by a machine to perform a calculation.

That compute is what is used during the training phase of models (when they "learn") and also during inference (when they make predictions). But doing that for the newest coolest 100B+ LLMs is difficult due to memory and computational costs. As a result, these LLMs usually require multiple high-end GPUs or multi-node clusters to be run, effectively gatekeeping access and preventing the majority from benefiting fully from their capabilities.

Petals is an existing open-source project that allows to collaborative run 100B+ LLMs by just loading a small part of the model and then team up with people serving the other parts to run inference or fine-tuning. Decentralized compute in action!

Compute is powering a new industrial revolution in intelligence, and decentralized compute is how we level the playing field.

But Petals has a problem; the incentives are just not there for contributors to run servers, share their GPUs and increase the capacity of the public swarm. Petals follows an approach known as volunteer computing, a type of distributed computing that started in 1996 with the Great Internet Mersenne Prime Search and has persisted until today with projects like BOINC, allowing many independent parties to combine their computational resources and collectively perform large-scale experiments. However, relying solely on volunteers means that its potential to scale is constrained.

The Petals community is already discussing about a centralized incentive system based on imaginary points. I think that we can do better. I believe that we can use Bitcoin to create the right incentives for decentralized compute and ensure widespread adoption, sustainability and impact.

PetalsLightning is just the first step towards that direction.

The Project

PetalsLightning is a chatbot interface that uses a Petals swarm (decentralized compute network) for collaborative inference of large language models (LLMs).

Demo video

<iframe class="remirror-iframe remirror-iframe-youtube" src="https://www.youtube-nocookie.com/embed/4bNltl5KWxM?" data-embed-type="youtube" allowfullscreen="true" frameborder="0"></iframe>

How it works 🚨

  1. Contributing servers provide a lightning address

  2. User pays per message to the main lightning address of the swarm

  3. Petals does some magic to select the best servers and calculates the response

  4. The payment gets distributed to the servers involved in the inference

flow

Note: The lightning addresses shown for the servers are just for demo purposes to showcase the functionality. But the change needed to the main Petals project to achieve that is minimal. Servers can already provide a "name" for the world to see, therefore adding another piece of information is trivial.

How was it built?

My context around ML/AI was zero before this hackathon, but it was the perfect opportunity to dive into both, lightning-enabled applications and AI. That led me to down the distributed compute path and after sharing my thoughts, findings and research in the discord, I ended up focusing on Petals. After all, open-source is all about building and evolving on top of other projects, and Petals has already done the hard work of figuring out the inner algorithms and logic for decentralized machine learning.

I've forked the original chat.petals.dev project and used a lot of the provided workshop resources to integrate lightning.

The project has two parts, a Flask app with a Websocket endpoint using a variation of the L402 protocol & a Nodejs server responsible for splitting the user's payment to the peers (servers) that have worked on the user's request. Petals doesn't work well with HTTP so I couldn't implement the L402 as it is, I had to do a bit of hacking to use the overall paywall idea for websockets.

Try it out

The project only runs on my local machine for now, but I can expose it using ngrok after request. The specs needed for hosting are a bit higher (replit definitely coudln't support it) than what is currently considered as cost-effective, therefore hosting it somewhere for days was not possible.

In order to run it on your own machine, you need to clone the project and add the following to your .env file:

  • API_KEY = <splitsats-api-key>
    this is the same key that is used for the SPLIT_API_KEY secret in the SplitSats server

  • MACAROON_SECRET_KEY= <your-secret> this is just a random secret

  • SWARM_LN_ADDRESS=<your-lightning-address> this is the main swarm address to which the user pays to send a message. This must match the address used for the NWC_URL in the SplitSats server.

Follow the README instructions to run the client. Normally you need to request access in order to serve LLaaMA 2, but to make this easier you can use my own access token. I will invalidated it sometime in the future. After running the pip install command, just run huggingface-cli login <retracted-access-token> and you are good to go.

Petals does not download the whole model (only the embeddings, layernorms and model head), but this will still download 20-30GB of data.

The next step is to run your own SplitSats server. As mentioned, the SPLIT_API_KEY secret must match the one in the Flask app and the NWC_URL must be for the same lightning address used as SWARM_LN_ADDRESS.This requirement for a matching address is to allow this kind of payment flow:
payment flow

If you are using an Alby account, you can generate the NWC_URL at https://nwc.getalby.com

You are now good to go! Thank you for taking the time to check my project.

Next steps

  • Move L402 logic from the Flask app to the SplitSats server and generate the initial invoice from there in order to make the flow more robust.

  • Integrate lightning payment more deeply into the Petals protocol.

  • Expand integration to include the existing fine-tuning capabilities of Petals.

  • Research on how to integrate lightning incentives into Petals' "sibling" project, hivemind, which enables decentralized training of models.

The possibilities are endless. Decentralized compute can become the accelerant for a global intelligence revolution, propelling the adoption of lightning and AI not only in underserved markets but throughout the entire world.