PhyloLM: Phylogenetic Mapping of Language Models
This space is under active development. New features and improvements will be added regularly and some things may not work properly. Feel free to open an issue if you encounter any problems.
Welcome to PhyloLM (paper - code) — a tool for comparing language models based on their behavioral similarity, inspired by methods from comparative genomics. Instead of architecture or weights, we use output behavior on diagnostic prompts as a behavioral fingerprint to compute a distance metric, akin to how biologists compare species using genetic data. This makes it possible to draw a unique map of all LLMs (various architectures, gated and non gated, ...).The goal of this space is to create a collaborative space where everyone can visualize these maps and extend them with models of their choice.
Explore Maps of Models
This interactive space allows users to explore model similarities through four types of visualizations:
- A similarity matrix (values range from 0 = dissimilar to 1 = highly similar).
- 2D and 3D scatter plots representing how close or far from each other LLMs are (plotted using UMAP).
- A tree to visualize distances between models (distance from leaf A to leaf B in the tree is similar to the distance between the two models)
Models are colored according to their family (e.g., LLaMA, OPT, Mistral) for the ones that were in the original paper. Models added by users are colored in grey for now.
Submit a Model
You may contribute new models to this collaborative space using compute resources. Once processed, the model will be compared to existing ones, and its results added to a shared public database. Model families (e.g., LLaMA, OPT, Mistral) are extracted from Hugging Face model cards and used only for visualization (e.g., coloring plots); they are not involved in the computation of similarity.
To add a new model:
- Enter the name of a model hosted on Hugging Face (e.g.,
'Qwen/Qwen2.5-7B-Instruct'
). - Click on the Run PhyloLM button.
- If the model has already been processed, you'll be notified and no new run will start.
- If it hasn't been processed, it will be downloaded and be evaluated.
⚠️ Be careful when submitting large LLMs (typically >15B parameters) as they may exceed the GPU RAM or the time limit, leading to failed runs.
Disclaimer
This is a research prototype and may contain bugs or limitations. All computed data are public and hosted on GitHub. If you'd like to contribute additional models — especially for gated or large models that cannot be processed via the web interface — you are welcome to submit a pull request to the repository cited above. All results are computed on the 'Math' set of genes used in the original paper.
Citation
If you find this project useful for your research, please consider citing the following paper: