Why should I not index datasets with gene symbols?

Gene symbols are widely used for readability, particularly for visualization. However, indexing datasets with gene symbols presents challenges:

  • A single gene may have multiple symbols or aliases.

  • Gene symbols change over time (e.g., BRCA2 was once FACD) without version tracking.

  • The same symbol can represent different genes across species.

  • Symbols may be misinterpreted by software (e.g., SEPT9 as “September 9” in Excel).

  • Formatting inconsistencies exist (e.g., case sensitivity, special characters).

Using unique identifiers like ENSEMBL gene IDs addresses these issues by providing:

  • A direct, stable mapping to genomic coordinates.

  • Consistency across databases.

  • Species-specific prefixes to prevent cross-species confusion.

  • Unique, permanent identifiers with standardized formatting.

Storing ENSEMBL gene IDs alongside gene symbols offers readability for visualization while maintaining robust data integrity. During curation, validating against ENSEMBL gene IDs ensures accurate mapping.

If only symbols are available for a dataset, you can map them to ENSEMBL IDs using standardize().

# !pip install 'lamindb[bionty]'
!lamin init --storage test-symbols --schema bionty
Hide code cell output
→ connected lamindb: testuser1/test-symbols
import lamindb as ln
import bionty as bt
import numpy as np
import pandas as pd
import anndata as ad

# create example AnnData object with gene symbols
rng = np.random.default_rng(42)
X = rng.integers(0, 100, size=(5, 10))
var = pd.DataFrame(index=pd.Index(['BRCA1', 'TP53', 'EGFR', 'KRAS', 'PTEN', 'MYC', 'VEGFA', 'IL6', 'TNF', 'GAPDH'], name="symbol"))
adata = ad.AnnData(X=X, var=var)
adata.var
Hide code cell output
→ connected lamindb: testuser1/test-symbols
symbol
BRCA1
TP53
EGFR
KRAS
PTEN
MYC
VEGFA
IL6
TNF
GAPDH
# map Gene symbols to ENSEMBL IDs
gene_mapper = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
    return_mapper=True,
    organism="human"
)
adata.var["ensembl_id"] = adata.var.index.map(gene_mapper)
adata.var
Hide code cell output
! found 10 symbols in Bionty: ['BRCA1', 'GAPDH', 'VEGFA', 'KRAS', 'IL6', 'MYC', 'TP53', 'EGFR', 'PTEN', 'TNF']
  please add corresponding Gene records via: `.from_values(['BRCA1', 'GAPDH', 'VEGFA', 'KRAS', 'IL6', 'MYC', 'TP53', 'EGFR', 'PTEN', 'TNF'])`
ensembl_id
symbol
BRCA1 ENSG00000012048
TP53 ENSG00000141510
EGFR ENSG00000146648
KRAS ENSG00000133703
PTEN ENSG00000171862
MYC ENSG00000136997
VEGFA ENSG00000112715
IL6 ENSG00000136244
TNF ENSG00000204490
GAPDH ENSG00000111640
standardized_genes = bt.Gene.from_values(
    ['ENSG00000141510', 'ENSG00000133703', 'ENSG00000111640', 'ENSG00000171862', 'ENSG00000204490', 'ENSG00000112715', 'ENSG00000146648', 'ENSG00000136997', 'ENSG00000012048', 'ENSG00000136244'],
    field=bt.Gene.ensembl_gene_id,
    organism="human"
)
ln.save(standardized_genes)

This allows for validating the the ensembl_id against the Gene registry using the bt.Gene.ensembl_gene_id field.

bt.Gene.validate(adata.var["ensembl_id"], field=bt.Gene.ensembl_gene_id)
Hide code cell output
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

Note

Gene symbols do not map one-to-one with ENSEMBL IDs. A single gene symbol may correspond to multiple ENSEMBL IDs due to:

  1. Gene Paralogs: Similar symbols can be shared among paralogous genes within the same species, resulting in one symbol linking to multiple ENSEMBL IDs.

  2. Pseudogenes: Some symbols represent both functional genes and their non-functional pseudogenes, each with distinct ENSEMBL IDs.

  3. Transcript Variants: One symbol may map to multiple ENSEMBL transcript IDs, each representing different isoforms or splice variants.

standardize() retrieves the first match in cases of multiple hits, which is generally sufficient but not perfectly accurate.

!lamin delete --force test-symbols
Hide code cell output
• deleting instance testuser1/test-symbols