GenomeUtils.Genome

Class Descriptions

Genome

Top-level container that manages chromosomes, genes, transcripts, and exons with indexing utilities.

Gene

Locus-bound gene that owns transcripts and provides access to the genomic sequence.

Transcript

Transcript with ordered exons, canonical sequence, and helpers for coordinate conversion.

Exon

Exon segment attached to a transcript and able to derive its nucleotide sequence.

Chromosome

Lazy-loaded chromosome wrapper that exposes sequence slices via loci and tracks genes.

Locus

Immutable utility for 1-based inclusive genomic coordinates with overlap/containment helpers.

GenomeElement

Abstract base class that unifies shared behavior for loci-based genome entities.

GenomeBuilder

Fluent builder that assembles a Genome from FASTA and GTF inputs.

Classes

class GenomeUtils.Genome.Chromosome(id, seq_index, genome=None, length=None, **kwargs)[source]

Bases: GenomeElement

Represents a chromosome, with sequence data loaded from file on demand.

Parameters:
  • id (str)

  • seq_index (SeqIO.index)

  • genome (Genome)

  • length (int)

add_gene(gene)[source]
Parameters:

gene (Gene)

property genes: List['Gene']
get_subsequence_by_locus(locus)[source]

Returns a subsequence of the chromosome for a given Locus.

Parameters:

locus (Locus)

Return type:

Bio.Seq.Seq

property sequence: Bio.Seq.Seq
class GenomeUtils.Genome.Exon(id, chr, start, end, strand, gene=None, transcripts=None, genome=None, sequence=None, **kwargs)[source]

Bases: GenomeElement

Represents an exon.

Parameters:
  • id (str)

  • chr (str)

  • start (int)

  • end (int)

  • strand (Literal['+', '-'])

  • gene (Gene)

  • transcripts (List['Transcript'])

  • genome (Genome)

  • sequence (Seq)

add_to_transcript(transcript)[source]

Add the exon to the transcript.

Parameters:

transcript (Transcript)

get_gene()[source]

Returns the Gene object that the exon belongs to.

Return type:

Gene

get_transcripts()[source]

Returns the Transcript object that the exon belongs to.

Return type:

List[‘Transcript’]

property sequence: Bio.Seq.Seq
class GenomeUtils.Genome.Gene(id, name, chr, start, end, strand, chromosome=None, genome=None, **kwargs)[source]

Bases: GenomeElement

Represents a gene.

Parameters:
add_transcript(transcript)[source]

Add a transcript to the gene.

Parameters:

transcript (Transcript)

get_chromosome()[source]

Returns the Chromosome object that this gene is on.

Return type:

Chromosome

property sequence: Bio.Seq.Seq
property transcripts: List['Transcript']
class GenomeUtils.Genome.Genome(id, species, name, **kwargs)[source]

Bases: object

Represents a Genome object, includes a collection of chromosomes, genes, transcripts, and exons.

Parameters:
add_chromosome(chromosome)[source]

Add a chromosome to the genome.

Parameters:

chromosome (Chromosome)

chromosome_by_id(chromosome_id)[source]

Get a chromosome by its ID using the index. Raises ValueError if not found.

Parameters:

chromosome_id (str)

Return type:

Chromosome

property chromosomes: List[Chromosome]

Get all chromosomes in the genome.

exon_by_id(exon_id)[source]

Get an exon by its ID using the index. Raises ValueError if not found.

Parameters:

exon_id (str)

Return type:

Exon

property exons: List[Exon]

Get all exons in the genome.

gene_by_id(gene_id)[source]

Get a gene by its ID using the index. Raises ValueError if not found.

Parameters:

gene_id (str)

Return type:

Gene

property genes: List[Gene]

Get all genes in the genome.

get_sequence_by_locus(locus)[source]

Get a sequence by its locus.

Parameters:

locus (Locus)

Return type:

Bio.Seq.Seq

index()[source]

Creates an index of all genes, transcripts, and exons for fast lookup. This method MUST be called after all genomic features have been added.

transcript_by_id(transcript_id)[source]

Get a transcript by its ID using the index. Raises ValueError if not found.

Parameters:

transcript_id (str)

Return type:

Transcript

property transcripts: List[Transcript]

Get all transcripts in the genome.

class GenomeUtils.Genome.GenomeBuilder(id, species, name, main_chromosomes=None, separate_scaffolds=True, **kwargs)[source]

Bases: object

Constructs a Genome object from various file formats.

This builder simplifies the process of assembling a complete Genome object by handling the parsing and integration of DNA sequences, cDNA sequences, and gene annotations from standard bioinformatics files.

The correct order of operations is:

  1. with_dna_fasta()

  2. with_cdna_fasta()

  3. with_gtf_file()

  4. build()

Example:

builder = GenomeBuilder(id="hg38", species="homo_sapiens", name="Human Reference Genome")
genome = (
    builder.with_dna_fasta(Path("path/to/dna.fa"))
    .with_cdna_fasta(Path("path/to/cdna.fa"))
    .with_gtf_file(Path("path/to/annotations.gtf"))
    .build()
)
Parameters:
  • id (str)

  • species (str)

  • name (str)

  • main_chromosomes (Optional[list[str]])

  • separate_scaffolds (bool)

build()[source]

Finalizes the Genome object by creating an index for fast lookups.

Return type:

Genome | tuple[Genome, Genome]

set_chromosome_filter(chromosomes)[source]

Set a filter to only include specified chromosomes.

Parameters:

chromosomes (list[str])

Return type:

GenomeBuilder

with_cdna_fasta(cdna_fasta_path)[source]

Loads transcript sequences from a cDNA FASTA file.

Parameters:

cdna_fasta_path (Path)

Return type:

GenomeBuilder

with_dna_fasta(dna_fasta_path)[source]

Loads chromosome sequences from a genomic DNA FASTA file. This must be the first step in the build process.

Parameters:

dna_fasta_path (Path)

Return type:

GenomeBuilder

with_gtf_file(gtf_path)[source]

Parses a GTF file to build the gene-transcript-exon hierarchy. with_dna_fasta() and with_cdna_fasta() must be called before this method.

Parameters:

gtf_path (Path)

Return type:

GenomeBuilder

class GenomeUtils.Genome.GenomeElement(id, locus, parent=None, genome=None, **kwargs)[source]

Bases: ABC

Abstract base class for genomic elements (e.g. chromosomes, genes, transcripts, exons, etc.).

Parameters:
property chr: str
property end: int
property parent: GenomeElement

Returns the parent of the genome element.

abstract property sequence: Bio.Seq.Seq
property start: int
property strand: str
class GenomeUtils.Genome.Locus(chr, start, end, strand='+')[source]

Bases: object

Represents a 1-based inclusive genomic coordinates on a chromosome.

Parameters:
chr: str
contains(other)[source]

Check if this locus completely contains another.

Parameters:

other (Locus)

Return type:

bool

end: int
overlaps(other)[source]

Check if this locus overlaps with another.

Parameters:

other (Locus)

Return type:

bool

start: int
strand: Literal['+', '-'] = '+'
class GenomeUtils.Genome.Transcript(id, chr, start, end, strand, sequence, gene=None, genome=None, **kwargs)[source]

Bases: GenomeElement

Represents a transcript.

Parameters:
add_exon(exon)[source]

Add an Exon to the transcript in a sorted manner.

Parameters:

exon (Exon)

exon_intervals()[source]

Get the exon intervals for this transcript.

Return type:

List[Tuple[int, int]]

property exons: List['Exon']
get_gene()[source]

Returns the Gene object that this transcript is associated with.

Return type:

Gene

property sequence: Bio.Seq.Seq
transcript_to_genomic_pos(start, end=None)[source]

Converts a 0-based, half-open transcript coordinate (or range) to a 1-based, inclusive genomic coordinate (or list of Locus objects).

Parameters:
  • start (int) – The 0-based start position on the transcript.

  • end (int | None) – The optional 0-based end position on the transcript. If None, a single point is converted. If provided, the range is [start, end).

Returns:

  • A Locus object for a single point or for a range within a single exon.

  • A list of Locus objects if the range spans multiple exons.

  • None if a single point maps to no location; an empty list for a range.

Return type:

Locus | List[Locus] | None