Key Concepts
Under construction...
BioLM applies the principles of modern language modeling to biological sequences such as proteins and nucleic acids.
This article introduces a few key concepts that will help you understand how BioLM’s models work and what their results represent.
Sequences as Language
At the core of BioLM is the idea that biological sequences — strings of amino acids or nucleotides — can be treated like language.
Just as words in a sentence follow rules of grammar, biological sequences have patterns and dependencies that encode function and structure.
Language models learn those patterns from large biological datasets.
Language Models
A language model is a type of AI model that learns to predict the next element in a sequence.
In biology, this means predicting how likely a certain amino acid or base is to appear, given the context of the others.
Through this process, the model develops an internal understanding of biochemical relationships and structures.
Transformers
Most language models used today — including those in BioLM — are based on the transformer architecture.
Transformers use a mechanism called attention, which allows them to weigh the relationships between all parts of a sequence at once.
This makes them especially powerful for modeling long sequences, where distant residues or bases still influence one another.
Tokens
Before a model can learn from sequences, each element (like an amino acid or nucleotide) must be converted into a token— a discrete symbol that represents it numerically.
The model’s vocabulary is made up of these tokens, similar to words in natural language models.
Embeddings
An embedding is a numerical representation of a sequence or token.
It captures the model’s internal understanding of what that sequence “means.”
Similar sequences will have similar embeddings, making them useful for clustering, visualization, and downstream analysis.
Pretraining and Fine-Tuning
Language models are first pretrained on large, unlabeled datasets to learn general biological patterns.
They can then be fine-tuned on smaller, task-specific datasets — for example, to classify antibodies, predict binding sites, or design new variants.
Generative and Predictive Tasks
- Generative models create new sequences that resemble natural ones, often optimized for specific properties.
- Predictive models evaluate existing sequences to estimate traits such as stability, binding affinity, or structure quality.
Model Outputs
Model results might include:
- Probabilities – how likely a sequence or residue is to occur
- Embeddings – numeric vectors representing biological meaning
- Predictions – property scores, classifications, or other computed outcomes
Why It Matters
Understanding these concepts helps you interpret what BioLM’s models are doing under the hood — whether you’re generating new proteins, predicting function, or fine-tuning your own model.
You don’t need to be a machine learning expert to use BioLM effectively, but having this shared vocabulary makes it easier to connect biological intuition with model outputs.