PyRanges objects

The main object in pyranges1 is the PyRanges object. It is a pandas DataFrame with additional methods for genomic operations.

class pyranges1.core.pyranges_main.PyRanges(*args, **kwargs)

Two-dimensional representation of genomic intervals and their annotations.

A PyRanges object must have the columns Chromosome, Start and End. A Strand column is optional and adds strand information to the intervals. Any other columns are allowed and are considered metadata.

You can initialize a PyRanges object like you would a pandas DataFrame, as long as the resulting DataFrame has the necessary columns (Chromosome, Start, End; Strand is optional). See examples below, and https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html for more information.

Parameters:
  • data (dict, pd.DataFrame, or None, default None)

  • index (Index or array-like)

  • columns (Index or array-like)

  • dtype (type, default None)

  • copy (bool or None, default None)

See also

pyranges.read_bed

read bed-file into PyRanges

pyranges.read_bam

read bam-file into PyRanges

pyranges.read_gff

read gff-file into PyRanges

pyranges.read_gtf

read gtf-file into PyRanges

Examples

>>> pr.PyRanges()
index    |    Chromosome    Start      End
int64    |    float64       float64    float64
-------  ---  ------------  ---------  ---------
PyRanges with 0 rows, 3 columns, and 1 index columns.
Contains 0 chromosomes.

You can initiatize PyRanges with a DataFrame:

>>> df = pd.DataFrame({"Chromosome": ["chr1", "chr2"], "Start": [100, 200],
...                    "End": [150, 201]})
>>> df
  Chromosome  Start  End
0       chr1    100  150
1       chr2    200  201
>>> pr.PyRanges(df)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1              100      150
      1  |    chr2              200      201
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 2 chromosomes.

Or you can use a dictionary of iterables:

>>> gr = pr.PyRanges({"Chromosome": [1, 1], "Strand": ["+", "-"], "Start": [1, 4], "End": [2, 27],
...                    "TP": [0, 1], "FP": [12, 11], "TN": [10, 9], "FN": [2, 3]})
>>> gr
  index  |      Chromosome  Strand      Start      End       TP       FP       TN       FN
  int64  |           int64  str         int64    int64    int64    int64    int64    int64
-------  ---  ------------  --------  -------  -------  -------  -------  -------  -------
      0  |               1  +               1        2        0       12       10        2
      1  |               1  -               4       27        1       11        9        3
PyRanges with 2 rows, 8 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Operations that remove a column required for a PyRanges return a DataFrame instead:

>>> gr.drop("Chromosome", axis=1)
  Strand  Start  End  TP  FP  TN  FN
0      +      1    2   0  12  10   2
1      -      4   27   1  11   9   3
>>> pr.PyRanges(dict(Chromosome=["chr1", "chr2"], Start=[1, 2], End=[2, 3]))
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1        2
      1  |    chr2                2        3
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 2 chromosomes.
property chromosomes: list[str]

Return the list of unique chromosomes in this PyRanges, in natsorted order (e.g. chr2 < chr11).

property chromosomes_and_strands: list[tuple[str, str]]

Return the list of unique (chromosome, strand) pairs in this PyRanges in natsorted order (e.g. chr2 < chr11).

Examples

>>> gr = pr.PyRanges({"Chromosome": [1, 2, 2, 3], "Start": [1, 2, 3, 9], "End": [3, 3, 10, 12], "Strand": ["+", "-", "+", "-"]})
>>> gr.chromosomes_and_strands
[(1, '+'), (2, '+'), (2, '-'), (3, '-')]
>>> gr.remove_strand().chromosomes_and_strands
Traceback (most recent call last):
...
ValueError: PyRanges has no strand column.
clip_ranges(chromsizes: dict[str | int, int] | PyRanges | None = None, *, remove: bool = False, only_right: bool = False) PyRanges

Clip or remove intervals outside of sequence (e.g. Chromosome) bounds.

Parameters:
  • chromsizes (dict or PyRanges or pyfaidx.Fasta or None, default None) – Dict or PyRanges describing the lengths of the sequences (the “Chromosomes” in the self object). pyfaidx.Fasta object is also accepted since it conveniently loads chromosome length. If None, clipping is only on the left, i.e. for the portions of intervals that are negative (Start < 0).

  • remove (bool, default False) – Drops intervals entirely if they are even partially out of bounds, instead of clipping them

  • only_right (bool, default False) – If True, remove or clip only intervals that are out-of-bounds on the right, and do not alter those out-of-bounds on the left (whose Start is < 0)

Examples

>>> import pyranges1 as pr
>>> d = {"Chromosome": [1, 1, 3], "Start": [1, 249250600, 5], "End": [2, 249250640, 7]}
>>> gr = pr.PyRanges(d)
>>> gr
  index  |      Chromosome      Start        End
  int64  |           int64      int64      int64
-------  ---  ------------  ---------  ---------
      0  |               1          1          2
      1  |               1  249250600  249250640
      2  |               3          5          7
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 2 chromosomes.
>>> chromsizes = {1: 249250621, 3: 500}
>>> chromsizes
{1: 249250621, 3: 500}
>>> gr.clip_ranges(chromsizes)
  index  |      Chromosome      Start        End
  int64  |           int64      int64      int64
-------  ---  ------------  ---------  ---------
      0  |               1          1          2
      1  |               1  249250600  249250621
      2  |               3          5          7
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 2 chromosomes.
>>> gr.clip_ranges(chromsizes, remove=True)
  index  |      Chromosome    Start      End
  int64  |           int64    int64    int64
-------  ---  ------------  -------  -------
      0  |               1        1        2
      2  |               3        5        7
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 2 chromosomes.
>>> del chromsizes[3]
>>> chromsizes
{1: 249250621}
>>> gr.clip_ranges(chromsizes)
Traceback (most recent call last):
...
ValueError: Not all chromosomes were in the chromsize dict.
Missing keys: {3}.
>>> w = pr.PyRanges({"Chromosome": [1, 1, 1], "Start": [-10, 249250600, 100], "End": [2, 249250640, 150]})
>>> w
  index  |      Chromosome      Start        End
  int64  |           int64      int64      int64
-------  ---  ------------  ---------  ---------
      0  |               1        -10          2
      1  |               1  249250600  249250640
      2  |               1        100        150
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
Invalid ranges:
  * 1 starts or ends are < 0. See indexes: 0
>>> w.clip_ranges()
  index  |      Chromosome      Start        End
  int64  |           int64      int64      int64
-------  ---  ------------  ---------  ---------
      0  |               1          0          2
      1  |               1  249250600  249250640
      2  |               1        100        150
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> w.clip_ranges({1:249250620}, only_right=True)
  index  |      Chromosome      Start        End
  int64  |           int64      int64      int64
-------  ---  ------------  ---------  ---------
      0  |               1        -10          2
      1  |               1  249250600  249250620
      2  |               1        100        150
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
Invalid ranges:
  * 1 starts or ends are < 0. See indexes: 0
cluster_overlaps(use_strand: Literal['auto'] | bool = 'auto', *, match_by: str | Iterable[str] | None = None, slack: int = 0, cluster_column: str = 'Cluster') PyRanges

Give overlapping intervals a common id.

Parameters:
  • use_strand ({"auto", True, False}, default: "auto") – Whether to cluster only intervals on the same strand. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • slack (int, default 0) – Length by which the criteria of overlap are loosened. A value of 1 clusters also bookended intervals. Higher slack values cluster more distant intervals (with a maximum distance of slack-1 between them).

  • cluster_column – Name the cluster column added in output. Default: “Cluster”

Returns:

PyRanges with an ID-column “Cluster” added.

Return type:

PyRanges

See also

PyRanges.merge

combine overlapping intervals into one

Examples

>>> gr = pr.PyRanges(dict(Chromosome=1, Start=[5, 6, 12, 16, 20, 22, 24], End=[9, 8, 16, 18, 23, 25, 27]))
>>> gr
  index  |      Chromosome    Start      End
  int64  |           int64    int64    int64
-------  ---  ------------  -------  -------
      0  |               1        5        9
      1  |               1        6        8
      2  |               1       12       16
      3  |               1       16       18
      4  |               1       20       23
      5  |               1       22       25
      6  |               1       24       27
PyRanges with 7 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.cluster_overlaps()
  index  |      Chromosome    Start      End    Cluster
  int64  |           int64    int64    int64     uint32
-------  ---  ------------  -------  -------  ---------
      0  |               1        5        9          0
      1  |               1        6        8          0
      2  |               1       12       16          1
      3  |               1       16       18          2
      4  |               1       20       23          3
      5  |               1       22       25          3
      6  |               1       24       27          3
PyRanges with 7 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.

Slack=1 will cluster also bookended intervals:

>>> gr.cluster_overlaps(slack=1)
  index  |      Chromosome    Start      End    Cluster
  int64  |           int64    int64    int64     uint32
-------  ---  ------------  -------  -------  ---------
      0  |               1        5        9          0
      1  |               1        6        8          0
      2  |               1       12       16          1
      3  |               1       16       18          1
      4  |               1       20       23          2
      5  |               1       22       25          2
      6  |               1       24       27          2
PyRanges with 7 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.

Higher values of slack will cluster more distant intervals:

>>> gr.cluster_overlaps(slack=3)
  index  |      Chromosome    Start      End    Cluster
  int64  |           int64    int64    int64     uint32
-------  ---  ------------  -------  -------  ---------
      0  |               1        5        9          0
      1  |               1        6        8          0
      2  |               1       12       16          1
      3  |               1       16       18          1
      4  |               1       20       23          1
      5  |               1       22       25          1
      6  |               1       24       27          1
PyRanges with 7 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
combine_interval_columns(function: Literal['intersect', 'union', 'swap'] | CombineIntervalColumnsOperation = 'intersect', *, start: str = 'Start', end: str = 'End', start2: str = 'Start_b', end2: str = 'End_b', drop_old_columns: bool = True) PyRanges

Use two pairs of columns representing intervals to create a new start and end column.

The function is designed as post-processing after join_overlaps to aggregate the coordinates of the two intervals. By default, the new start and end columns will be the intersection of the intervals.

Parameters:
  • function ({"intersect", "union", "swap"} or Callable, default "intersect") – How to combine the self and other intervals: “intersect”, “union”, or “swap” If a callable is passed, it should take four Series arguments: start1, end1, start2, end2; and return a tuple of two integers: (new_starts, new_ends).

  • start (str, default "Start") – Column name for Start of first interval

  • end (str, default "End") – Column name for End of first interval

  • start2 (str, default "Start_b") – Column name for Start of second interval

  • end2 (str, default "End_b") – Column name for End of second interval

  • drop_old_columns (bool, default True) – Whether to drop the above mentioned columns.

Examples

>>> gr1 = pr.example_data.aorta.head(3).remove_nonloc_columns()
>>> gr1
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1             9916    10115  -
      1  |    chr1             9939    10138  +
      2  |    chr1             9951    10150  -
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr2 = pr.example_data.aorta2.head(3).remove_nonloc_columns()
>>> gr2
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1             9988    10187  -
      1  |    chr1            10073    10272  +
      2  |    chr1            10079    10278  -
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> j = gr1.join_overlaps(gr2)
>>> j
  index  |    Chromosome      Start      End  Strand        Start_b    End_b
  int64  |    category        int64    int64  category        int64    int64
-------  ---  ------------  -------  -------  ----------  ---------  -------
      0  |    chr1             9916    10115  -                9988    10187
      0  |    chr1             9916    10115  -               10079    10278
      1  |    chr1             9939    10138  +               10073    10272
      2  |    chr1             9951    10150  -                9988    10187
      2  |    chr1             9951    10150  -               10079    10278
PyRanges with 5 rows, 6 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.

Combine the interval coordinates in different ways:

>>> j.combine_interval_columns()        # default: "intersect"
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1             9988    10115  -
      0  |    chr1            10079    10115  -
      1  |    chr1            10073    10138  +
      2  |    chr1             9988    10150  -
      2  |    chr1            10079    10150  -
PyRanges with 5 rows, 4 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> j.combine_interval_columns("union")
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1             9916    10187  -
      0  |    chr1             9916    10278  -
      1  |    chr1             9939    10272  +
      2  |    chr1             9951    10187  -
      2  |    chr1             9951    10278  -
PyRanges with 5 rows, 4 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> j.combine_interval_columns("swap")
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1             9988    10187  -
      0  |    chr1            10079    10278  -
      1  |    chr1            10073    10272  +
      2  |    chr1             9988    10187  -
      2  |    chr1            10079    10278  -
PyRanges with 5 rows, 4 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> def custom_combine(s1, e1, s2, e2):   # keep Start from first, End from second
...     return (s1, e2)
>>> j.combine_interval_columns(custom_combine)
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1             9916    10187  -
      0  |    chr1             9916    10278  -
      1  |    chr1             9939    10272  +
      2  |    chr1             9951    10187  -
      2  |    chr1             9951    10278  -
PyRanges with 5 rows, 4 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.
complement_ranges(group_by: str | Iterable[str] | None = None, *, use_strand: Literal['auto'] | bool = 'auto', include_first_interval: bool = False, group_sizes_col: str = 'Chromosome', chromsizes: dict[str | int, int] | None = None) PyRanges

Return the internal complement of the intervals, i.e. its introns.

The complement of an interval is the set of intervals that are not covered by the original interval. This function is useful for obtaining the introns of a set of exons, corresponding to the “internal” complement, i.e. excluding the first and last portion of each chromosome not covered by intervals.

Parameters:
  • group_by (str or list, optional) – Column(s) to group intervals (e.g. exons into transcripts). If provided, the complement will be calculated separately for each group.

  • use_strand ({"auto", True, False}, default "auto") – Whether to return complement intervals separately for those on the positive and negative strands. The default “auto” means that strand information is used if present and valid (see .strand_valid).

  • include_first_interval (bool, default False) – If True, include the external complement interval at the beginning of the chromosome (or group), i.e. the interval from the start of the chromosome up to the first interval.

  • group_sizes_col (str, default CHROM_COL) – The column name used to match keys in the chromsizes mapping. This determines the total size of each chromosome (or group) when calculating external complement intervals.

  • chromsizes (dict[str | int, int] or None, optional) – If provided, external complement intervals will also be returned, i.e. the intervals corresponding to the beginning of the chromosome up to the first interval and from the last interval to the end of the chromosome. The dictionary should map chromosome (or group) identifiers to their total sizes. A PyRanges or pyfaidx.Fasta object is also accepted since it conveniently loads chromosome lengths.

Notes

  • To ensure non-overlap among the input intervals, merge_overlaps is run before the complement is calculated.

  • Bookended intervals will result in no complement intervals returned since they would be of length 0.

See also

PyRanges.subtract_overlaps

report non-overlapping subintervals

PyRanges.outer_ranges

report the boundaries of groups of intervals (e.g. transcripts/genes)

Examples

>>> a = pr.PyRanges(dict(Chromosome="chr1", Start=[2, 10, 20, 40], End=[5, 18, 30, 46], ID=['a', 'a', 'b', 'b']))
>>> a
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                2        5  a
      1  |    chr1               10       18  a
      2  |    chr1               20       30  b
      3  |    chr1               40       46  b
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> a.complement_ranges('ID', group_sizes_col="ID", chromsizes={"a": 22, "b": 100}, include_first_interval=True)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                0        2  a
      1  |    chr1                5       10  a
      2  |    chr1               18       22  a
      3  |    chr1                0       20  b
      4  |    chr1               30       40  b
      5  |    chr1               46      100  b
PyRanges with 6 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.

Get complement of the whole set of intervals, without grouping:

Using complement to get introns:

>>> a.complement_ranges('ID')
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                5       10  a
      1  |    chr1               30       40  b
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> a.complement_ranges()
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                5       10
      1  |    chr1               18       20
      2  |    chr1               30       40
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

Include external intervals:

>>> a.complement_ranges(chromsizes={'chr1': 10000}, include_first_interval=True)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                0        2
      1  |    chr1                5       10
      2  |    chr1               18       20
      3  |    chr1               30       40
      4  |    chr1               46    10000
PyRanges with 5 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> a.complement_ranges('ID', chromsizes={'chr1': 10000}, include_first_interval=True)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                0        2  a
      1  |    chr1                5       10  a
      2  |    chr1               18    10000  a
      3  |    chr1                0       20  b
      4  |    chr1               30       40  b
      5  |    chr1               46    10000  b
PyRanges with 6 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.

For complement of whole sets of intervals, you can explicitly use_strand or not:

>>> b = pr.PyRanges(dict(Chromosome="chr1", Start=[1, 10, 20, 40], End=[5, 18, 30, 46],
...                      Strand=['+', '+', '-', '-']))
>>> b
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        5  +
      1  |    chr1               10       18  +
      2  |    chr1               20       30  -
      3  |    chr1               40       46  -
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> b.complement_ranges(use_strand=True)  # same as b.complement_ranges() because b.strand_valid == True
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                5       10  +
      1  |    chr1               30       40  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> b.complement_ranges(use_strand=False)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                5       10
      1  |    chr1               18       20
      2  |    chr1               30       40
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> b.complement_ranges(use_strand=False, chromsizes={'chr1': 10000}, include_first_interval=True)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                0        1
      1  |    chr1                5       10
      2  |    chr1               18       20
      3  |    chr1               30       40
      4  |    chr1               46    10000
PyRanges with 5 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

Bookended intervals (indices 0-1 below) and overlapping intervals (2-3) won’t return any in-between intervals:

>>> c = pr.PyRanges(dict(Chromosome="chr1", Start=[1, 5, 8, 10], End=[5, 7, 14, 16]))
>>> c.complement_ranges()
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                7        8
PyRanges with 1 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
compute_interval_metrics(metrics: str | Iterable[str] | Mapping[str, str] = 'fraction', *, start: str = 'Start', end: str = 'End', start2: str = 'Start_b', end2: str = 'End_b', denom: str = 'first') PyRanges

Attach interval-relationship metrics as new columns.

Parameters:
  • metrics

    One of the following forms:
    • single string, eg “length”

    • iterable of strings, eg [“fraction”, “jaccard”]

    • mapping {metric_name -> new_column_name} to rename on the fly

    Accepted metric names are listed in VALID_METRICS.

  • denom ({"first", "second", "union"}, default "first") – Denominator for the fraction metric. Must be “first”, “second” or “union”.

  • start (str, default START_COL / END_COL) – Column names holding the first interval coordinates.

  • end (str, default START_COL / END_COL) – Column names holding the first interval coordinates.

  • start2 (str, default START_COL + "_b" / END_COL + "_b") – Column names holding the second interval coordinates.

  • end2 (str, default START_COL + "_b" / END_COL + "_b") – Column names holding the second interval coordinates.

  • denom – Denominator used by the fraction metric.

Returns:

  • RangeFrame – Copy of self with extra metric columns.

  • Metrics

  • ——-

  • overlap_length – Raw number of overlapping bases.

  • fraction – Overlap divided by a denominator chosen with denom (“first”, “second”, or “union”).

  • jaccard – Overlap divided by the union length of the two intervals.

  • distance – Positive gap in bases when intervals do not touch; 0 when they overlap or abut.

  • overlap – Boolean flag - True if at least one base overlaps.

  • signed_distance – Same as distance but signed: negative when the second interval is upstream of the first, positive when downstream, 0 when touching/overlapping.

  • midpoint_distance – Absolute distance between interval midpoints.

  • symmetric_coverage – 2 * overlap ÷ (length1 + length2). Ranges from 0 to 1.

  • relative_direction – For frames that contain “Strand” and “Strand_b”: “same” if strands match, “opposite” if they differ, “unknown” if either strand is “.” or missing.

Examples

>>> import pyranges1 as pr
>>> df = pd.DataFrame(
...     {
...         "Chromosome": ["chr1"] * 5,
...         "Start":      [2, 10, 20, 40, 80],
...         "End":        [8, 12, 25, 45, 85],
...         "Strand":     ["+", "-", "+", "+", "-"],
...         "Start_b":    [5,  9, 23, 60, 70],
...         "End_b":      [7, 20, 30, 70, 75],
...         "Strand_b":   ["+", "+", "-", "-", "+"],
...     }
... )
>>> gr = pr.PyRanges(df)

# length >>> gr.compute_interval_metrics(“overlap_length”)[“overlap_length”].tolist() [2, 2, 2, 0, 0]

# fraction (overlap / first interval length) >>> gr.compute_interval_metrics(“fraction”)[“fraction”].round(2).tolist() [0.33, 1.0, 0.4, 0.0, 0.0]

# jaccard >>> gr.compute_interval_metrics(“jaccard”)[“jaccard”].round(2).tolist() [0.33, 0.18, 0.2, 0.0, 0.0]

# distance (unsigned gap; 0 when overlapping) >>> gr.compute_interval_metrics(“distance”)[“distance”].tolist() [0, 0, 0, 15, 5]

# overlap flag >>> gr.compute_interval_metrics(“overlap”)[“overlap”].tolist() [True, True, True, False, False]

# signed_distance >>> gr.compute_interval_metrics(“signed_distance”)[“signed_distance”].tolist() [0, 0, 0, 15, -5]

# midpoint_distance >>> gr.compute_interval_metrics(“midpoint_distance”)[“midpoint_distance”].tolist() [1.0, 3.5, 4.0, 22.5, 10.0]

# symmetric_coverage >>> gr.compute_interval_metrics(“symmetric_coverage”)[“symmetric_coverage”].round(2).tolist() [0.5, 0.31, 0.33, 0.0, 0.0]

# relative_direction (requires strand columns) >>> gr.compute_interval_metrics(“relative_direction”)[“relative_direction”].tolist() [‘same’, ‘opposite’, ‘opposite’, ‘opposite’, ‘opposite’]

copy(*args, **kwargs) PyRanges

Return a copy of the PyRanges.

count_overlaps(other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto', *, match_by: str | list[str] | None = None, slack: int = 0, overlap_col: str = 'Count') PyRanges

Count number of overlaps per interval.

For each interval in self, report how many intervals in ‘other’ overlap with it.

Parameters:
  • other (PyRanges) – Count overlaps with this PyRanges.

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.

  • slack (int, default 0) – Temporarily lengthen intervals in self before searching for overlaps.

  • overlap_col (str, default "Count") – Name of column with overlap counts.

Returns:

PyRanges with a column of overlaps added.

Return type:

PyRanges

See also

pyranges.count_overlaps

count overlaps from multiple PyRanges

Examples

>>> f1 = pr.example_data.f1.remove_nonloc_columns()
>>> f1
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1                3        6  +
      1  |    chr1                5        7  -
      2  |    chr1                8        9  +
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> f2 = pr.example_data.f2.remove_nonloc_columns()
>>> f2
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1                1        2  +
      1  |    chr1                6        7  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> f1.count_overlaps(f2)
  index  |    Chromosome      Start      End  Strand         Count
  int64  |    category        int64    int64  category      uint32
-------  ---  ------------  -------  -------  ----------  --------
      0  |    chr1                3        6  +                  0
      1  |    chr1                5        7  -                  1
      2  |    chr1                8        9  +                  0
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> f1.count_overlaps(f2, slack=1, strand_behavior="ignore")
  index  |    Chromosome      Start      End  Strand         Count
  int64  |    category        int64    int64  category      uint32
-------  ---  ------------  -------  -------  ----------  --------
      0  |    chr1                3        6  +                  1
      1  |    chr1                5        7  -                  1
      2  |    chr1                8        9  +                  0
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> annotation = pr.example_data.ensembl_gtf.get_with_loc_columns(['transcript_id', 'Feature'])
>>> reads = pr.random(1000, chromsizes={'1':150000}, strand=False, seed=123)
>>> annotation.count_overlaps(reads, overlap_col="NumberOverlaps")
index    |    Chromosome    Start    End      Strand      transcript_id    Feature     NumberOverlaps
int64    |    category      int64    int64    category    str              category    uint32
-------  ---  ------------  -------  -------  ----------  ---------------  ----------  ----------------
0        |    1             11868    14409    +           nan              gene        17
1        |    1             11868    14409    +           ENST00000456328  transcript  17
2        |    1             11868    12227    +           ENST00000456328  exon        3
3        |    1             12612    12721    +           ENST00000456328  exon        1
...      |    ...           ...      ...      ...         ...              ...         ...
7        |    1             120724   133723   -           ENST00000610542  transcript  76
8        |    1             133373   133723   -           ENST00000610542  exon        1
9        |    1             129054   129223   -           ENST00000610542  exon        3
10       |    1             120873   120932   -           ENST00000610542  exon        1
PyRanges with 11 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
downstream(length: int, gap: int = 0, *, group_by: str | Iterable[str] | None = None, use_strand: Literal['auto'] | bool = 'auto') PyRanges

Return regions downstream (at the 5’ side) of input intervals.

Parameters:
  • length (int) – Size of the region (bp), > 0.

  • gap (int, default 0) – Distance between input intervals and region; use negative to include some overlap.

  • group_by (str or list of str or None) – Name(s) of column(s) to group intervals. If provided, one region per group (e.g. transcript) is returned.

  • use_strand ({"auto", True, False}, default: "auto") – Whether to consider strand; if so, the downstream window of negative intervals is on their left. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

See also

PyRanges.slice_ranges

obtain subsequences of intervals, providing transcript-level coordinates

PyRanges.upstream

return regions upstream of input intervals or transcripts

PyRanges.three_end

return the 3’ end of intervals or transcripts

PyRanges.extend_ranges

return intervals or transcripts extended at one or both ends

Examples

>>> a = pr.PyRanges({'Chromosome':['chr1','chr1'],
...                  'Start':[100,200],'End':[120,220],
...                  'Strand':['+','-']})
>>> a
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1              100      120  +
      1  |    chr1              200      220  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Default 10-bp window butt-ended to the feature:

>>> a.downstream(10)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1              120      130  +
      1  |    chr1              190      200  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

With a 5-bp gap:

>>> a.downstream(10, gap=5)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1              125      135  +
      1  |    chr1              185      195  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

With a 5-bp overlap:

>>> a.downstream(10, gap=-5)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1              115      125  +
      1  |    chr1              195      205  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Transcript-aware (two 2-exon transcripts):

>>> ex = pr.PyRanges({'Chromosome':['chr1']*4,
...                   'Start':[0,10,30,50],'End':[5,15,40,60],
...                   'Strand':['+','+','-','-'],
...                   'Tx':['tx1','tx1','tx2','tx2']})
>>> ex
  index  |    Chromosome      Start      End  Strand    Tx
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -----
      0  |    chr1                0        5  +         tx1
      1  |    chr1               10       15  +         tx1
      2  |    chr1               30       40  -         tx2
      3  |    chr1               50       60  -         tx2
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> ex.downstream(5, group_by='Tx')
  index  |    Chromosome      Start      End  Strand    Tx
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -----
      1  |    chr1               15       20  +         tx1
      2  |    chr1               25       30  -         tx2
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Note that upstream regions may extend beyond the start of the chromosome, resulting in invalid ranges. See clip_ranges() to fix this.

>>> ex.downstream(50, group_by='Tx')
  index  |    Chromosome      Start      End  Strand    Tx
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -----
      1  |    chr1               15       65  +         tx1
      2  |    chr1              -20       30  -         tx2
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
Invalid ranges:
  * 1 starts or ends are < 0. See indexes: 2
extend_ranges(ext: int | None = None, ext_5: int | None = None, ext_3: int | None = None, group_by: str | Iterable[str] | None = None, use_strand: Literal['auto'] | bool = 'auto') PyRanges

Extend the intervals from the 5’ and/or 3’ ends.

The Strand (if valid) is considered when extending the intervals: a 5’ extension applies to the Start of a “+” strand interval and to the End of a “-” strand interval.

Parameters:
  • ext (int or None) – Extend intervals by this amount from both ends.

  • ext_5 (int or None) – Extend intervals by this amount from the 5’ end.

  • ext_3 (int or None) – Extend intervals by this amount from the 3’ end.

  • group_by (str or list of str, default: None) – group intervals by these column name(s) (e.g. into multi-exon transcripts), so that the extension is applied only to the left-most and/or right-most interval.

  • use_strand ({"auto", True, False}, default: "auto") – If False, ignore strand information when extending intervals. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

See also

PyRanges.slice_ranges

obtain subsequences of intervals, providing transcript-level coordinates

PyRanges.upstream

return regions upstream of input intervals or transcripts

PyRanges.downstream

return regions downstream of input intervals or transcripts

PyRanges.five_end

return the 5’ end of intervals or transcripts

PyRanges.three_end

return the 3’ end of intervals or transcripts

PyRanges.extend_ranges

return intervals or transcripts extended at one or both ends

Examples

>>> d = {'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 8, 5], 'End': [6, 9, 7],
...      'Strand': ['+', '+', '-']}
>>> gr = pr.PyRanges(d)
>>> gr
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        6  +
      1  |    chr1                8        9  +
      2  |    chr1                5        7  -
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.extend_ranges(3)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                0        9  +
      1  |    chr1                5       12  +
      2  |    chr1                2       10  -
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.extend_ranges(ext_3=1, ext_5=2)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        7  +
      1  |    chr1                6       10  +
      2  |    chr1                4        9  -
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.extend_ranges(ext_3=1, ext_5=2, use_strand=False)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        7  +
      1  |    chr1                6       10  +
      2  |    chr1                3        8  -
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Extending by negative values will contract the intervals. This may yield invalid intervals:

>>> gr.extend_ranges(-1)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                4        5  +
      1  |    chr1                9        8  +
      2  |    chr1                6        6  -
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
Invalid ranges:
  * 2 intervals are empty or negative length (end <= start). See indexes: 1, 2

Extending beyond the boundaries of the chromosome is allowed though it yields invalid ranges (below). See clip_ranges() to fix this.

>>> gr.extend_ranges(4)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1               -1       10  +
      1  |    chr1                4       13  +
      2  |    chr1                1       11  -
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
Invalid ranges:
  * 1 starts or ends are < 0. See indexes: 0
>>> gr['transcript_id']=['a', 'a', 'b']
>>> gr.extend_ranges(group_by='transcript_id', ext_3=3)
  index  |    Chromosome      Start      End  Strand    transcript_id
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ---------------
      0  |    chr1                3        6  +         a
      1  |    chr1                8       12  +         a
      2  |    chr1                2        7  -         b
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
five_end(group_by: str | Iterable[str] | None = None, ext: int = 0) PyRanges

Return the five prime end of intervals.

The five prime end is the start of a forward strand or the end of a reverse strand. All returned intervals have length of 1.

Parameters:
  • group_by (str or list of str, default: None) – Optional column name(s). If provided, the five prime end is calculated for each group of intervals.

  • ext (int, default 0) – Lengthen the resulting intervals on both ends by this amount.

See also

PyRanges.upstream

return regions upstream of input intervals or transcripts

PyRanges.three_end

return the 3’ end of intervals or transcripts

PyRanges.extend_ranges

return intervals or transcripts extended at one or both ends

Returns:

PyRanges with the five prime ends

Return type:

PyRanges

Note

Requires the PyRanges to be stranded.

See also

PyRanges.three_end

return the 3’ end

PyRanges.slice_ranges

return subintervals specified in relative mRNA-based coordinates

Examples

>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 10, 5], 'End': [9, 14, 7],
...                    'Strand': ["+", "+", "-"], 'Name': ['a', 'a', 'b']})
>>> gr
  index  |    Chromosome      Start      End  Strand    Name
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |    chr1                3        9  +         a
      1  |    chr1               10       14  +         a
      2  |    chr1                5        7  -         b
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.five_end()
  index  |    Chromosome      Start      End  Strand    Name
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |    chr1                3        4  +         a
      1  |    chr1               10       11  +         a
      2  |    chr1                6        7  -         b
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.five_end(group_by='Name')
  index  |    Chromosome      Start      End  Strand    Name
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |    chr1                3        4  +         a
      2  |    chr1                6        7  -         b
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.five_end(group_by='Name', ext=1)
  index  |    Chromosome      Start      End  Strand    Name
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |    chr1                2        5  +         a
      2  |    chr1                5        8  -         b
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
flip_strand() PyRanges

Flip the strand of every interval (+ → - and - → +).

All other columns remain unchanged. If the object does not contain a valid Strand column (see .strand_valid) a ValueError is raised.

Returns:

A new PyRanges whose Strand column is flipped.

Return type:

PyRanges

Examples

>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr1'],
...                   'Start': [0, 10], 'End': [5, 15],
...                   'Strand': ['+', '-']})
>>> gr
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                0        5  +
      1  |    chr1               10       15  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.flip_strand()
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                0        5  -
      1  |    chr1               10       15  +
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Attempting to flip when strands are missing or invalid:

>>> pr.PyRanges({'Chromosome': ['chr1'], 'Start': [0], 'End': [5]}).flip_strand()
Traceback (most recent call last):
    ...
ValueError: strand column is missing or invalid
get_sequence(path: Path | None = None, *, pyfaidx_fasta: pyfaidx.Fasta | None = None, use_strand: Literal['auto'] | bool = 'auto', group_by: str | Iterable[str] | None = None, sequence_column: str = 'Sequence') Series

Get the sequence of the intervals from a fasta file.

Parameters:
  • path (Path) – Path to fasta file. It will be indexed using pyfaidx if an index is not found

  • pyfaidx_fasta (pyfaidx.Fasta) – Alternative method to provide fasta target, as a pyfaidx.Fasta object

  • use_strand ({"auto", True, False}, default: "auto") – If True, intervals on the reverse strand will be reverse complemented. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

group_bystr or list of str, optional

If provided, intervals grouped by this/these ID column(s) and the corresponding sequences are concatenated 5’->3’. This is useful for obtaining the full sequences of multi-exon transcripts.

sequence_column: str, default “Sequence”

What the added column will be called.

Returns:

Sequences, one per interval, with the same index as self. If group_by is provided, instead returns one sequence per group, with the index being the group ID(s).

Return type:

Series

Note

This function requires the library pyfaidx, it can be installed with conda install -c bioconda pyfaidx or pip install pyfaidx.

Sorting the PyRanges is likely to improve the speed. Intervals on the negative strand will be reverse complemented.

Warning

Note that the names in the fasta header and self.Chromosome must be the same.

See also

pyranges.seqs

submodule with sequence-related functions

Examples

>>> import pyranges1 as pr
>>> r = pr.PyRanges({"Chromosome": ["chr1", "chr1"],
...                   "Start": [5, 0], "End": [8, 5],
...                   "Strand": ["+", "-"]})
>>> r
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                5        8  +
      1  |    chr1                0        5  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> tmp_handle = open("temp.fasta", "w+")
>>> _ = tmp_handle.write(">chr1\n")
>>> _ = tmp_handle.write("GTAATCAT\n")
>>> tmp_handle.close()
>>> seq = r.get_sequence("temp.fasta", sequence_column="Sequence")
>>> seq
0      CAT
1    ATTAC
Name: Sequence, dtype: object
>>> r["seq"] = seq
>>> r
  index  |    Chromosome      Start      End  Strand    seq
  int64  |    str             int64    int64  str       object
-------  ---  ------------  -------  -------  --------  --------
      0  |    chr1                5        8  +         CAT
      1  |    chr1                0        5  -         ATTAC
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> r.get_sequence("temp.fasta", use_strand=False)
0      CAT
1    GTAAT
Name: Sequence, dtype: object

Fetching full sequences of transcripts:

>>> gr = pr.PyRanges({"Chromosome": ['chr1'] * 5,
...                   "Start": [0, 9, 18, 9, 18], "End": [4, 13, 21, 13, 21],
...                   "Strand":['+', '-', '-', '-', '-'],
...                   "transcript": ['t1', 't2', 't2', 't4', 't5']})
>>> tmp_handle = open("temp.fasta", "w+")
>>> _ = tmp_handle.write(">chr1\n")
>>> _ = tmp_handle.write("AAACCCTTTGGGAAACCCTTTGGG\n")
>>> tmp_handle.close()
>>> seq = gr.get_sequence(path="temp.fasta", group_by='transcript')
>>> seq  
transcript
t1       AAAC
t2    AAATCCC
t4       TCCC
t5        AAA
Name: Sequence, dtype: object

With use_strand=False, all intervals are treated as if on the forward strand:

>>> seq2 = gr.get_sequence(path="temp.fasta", group_by='transcript', use_strand=False, sequence_column="Seq2")
>>> seq2 
transcript
t1       AAAC
t2    GGGATTT
t4       GGGA
t5        TTT
Name: Seq2, dtype: object

To write to a file in fasta format: >>> with open(‘outfile.fasta’, ‘w’) as fw: … nchars=60 … for oneid, oneseq in seq.items(): … s = ‘\n’.join([ oneseq[i:i+nchars] for i in range(0, len(oneseq), nchars)]) … _bytes_written = fw.write(f’>{oneid}\n{s}\n’)

get_with_loc_columns(key: str | Iterable[str], *, preserve_loc_order: bool = False) PyRanges

Return a PyRanges with the requested columns, as well as the genome location columns.

Parameters:
  • key (str or iterable of str) – Column(s) to return.

  • preserve_loc_order (bool, default False) – Whether to preserve the order of the genome location columns. If False, the genome location columns will be moved to the left.

Returns:

PyRanges with the requested columns.

Return type:

PyRanges

See also

PyRanges.remove_nonloc_columns

remove all columns that are not genome location columns.

Examples

>>> gr = pr.PyRanges({"Chromosome": [1], "Start": [895], "Strand": ["+"],
...                   "Score": [1], "Score2": [2], "End": [1259]})
>>> gr
  index  |      Chromosome    Start  Strand      Score    Score2      End
  int64  |           int64    int64  str         int64     int64    int64
-------  ---  ------------  -------  --------  -------  --------  -------
      0  |               1      895  +               1         2     1259
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

Genomic location columns are moved to the left by default:

>>> gr.get_with_loc_columns(["Score2", "Score", "Score2"])
  index  |      Chromosome    Start      End  Strand      Score2    Score    Score2
  int64  |           int64    int64    int64  str          int64    int64     int64
-------  ---  ------------  -------  -------  --------  --------  -------  --------
      0  |               1      895     1259  +                2        1         2
PyRanges with 1 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> gr.get_with_loc_columns(["Score2", "Score"], preserve_loc_order=True)
  index  |      Chromosome    Start  Strand      Score2    Score      End
  int64  |           int64    int64  str          int64    int64    int64
-------  ---  ------------  -------  --------  --------  -------  -------
      0  |               1      895  +                2        1     1259
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> gr.get_with_loc_columns(["Score2", "Score", "Score2"], preserve_loc_order=True)
Traceback (most recent call last):
...
ValueError: Duplicate keys not allowed when preserve_loc_order is True.
group_cumsum(group_by: str | Iterable[str] | None = None, *, use_strand: Literal['auto'] | bool = 'auto', cumsum_start_column: str | None = None, cumsum_end_column: str | None = None, keep_order: bool = True) PyRanges

Strand-aware cumulative length of every interval within each chromosome-level group.

For every chromosome (and, if supplied, every unique combination in group_by) the intervals are walked 5→3 on their own strand. Two new columns are added:

  • cumsum_start_column - running total before the interval

  • cumsum_end_column - running total after the interval

Parameters:
  • group_by (str or list, default None) – Additional column(s) that must match for two intervals to share a cumulative coordinate space. When None all intervals on the same chromosome are cumulated together.

  • cumsum_start_column (str | None, default None) – Names of the columns added to the returned frame. If None is given, Start and End is used.

  • cumsum_end_column (str | None, default None) – Names of the columns added to the returned frame. If None is given, Start and End is used.

  • use_strand ({"auto", True, False}, default: "auto") – Whether negative strand intervals should be sliced in descending order, meaning 5’ to 3’. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

  • keep_order (bool, default True) – Whether to output results in the original row order.

Returns:

Copy of self with the two cumulative-length columns appended.

Return type:

PyRanges

Examples

>>> gr = pr.example_data.ensembl_gtf.get_with_loc_columns(["Feature", "gene_name"])
>>> gr = gr[gr.Feature == "exon"]
>>> gr
  index  |      Chromosome    Start      End  Strand      Feature     gene_name
  int64  |        category    int64    int64  category    category    str
-------  ---  ------------  -------  -------  ----------  ----------  -----------
      2  |               1    11868    12227  +           exon        DDX11L1
      3  |               1    12612    12721  +           exon        DDX11L1
      4  |               1    13220    14409  +           exon        DDX11L1
      5  |               1   112699   112804  -           exon        AL627309.1
      6  |               1   110952   111357  -           exon        AL627309.1
      8  |               1   133373   133723  -           exon        AL627309.1
      9  |               1   129054   129223  -           exon        AL627309.1
     10  |               1   120873   120932  -           exon        AL627309.1
PyRanges with 8 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.group_cumsum(group_by="gene_name")
  index  |      Chromosome    Start      End  Strand      Feature     gene_name
  int64  |        category    int64    int64  category    category    str
-------  ---  ------------  -------  -------  ----------  ----------  -----------
      2  |               1        0      359  +           exon        DDX11L1
      3  |               1      359      468  +           exon        DDX11L1
      4  |               1      468     1657  +           exon        DDX11L1
      5  |               1      578      683  -           exon        AL627309.1
      6  |               1      683     1088  -           exon        AL627309.1
      8  |               1        0      350  -           exon        AL627309.1
      9  |               1      350      519  -           exon        AL627309.1
     10  |               1      519      578  -           exon        AL627309.1
PyRanges with 8 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
groupby(*args, **kwargs) PyRangesDataFrameGroupBy

Groupby PyRanges.

property has_strand: bool

Return whether PyRanges has a strand column.

Does not check whether the strand column contains valid values.

intersect_overlaps(other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto', *, multiple: Literal['first', 'all', 'last', 'contained'] = 'all', match_by: str | Iterable[str] | None = None, preserve_input_order: bool = True) PyRanges

Return overlapping subintervals.

Returns the segments of the intervals in self which overlap with those in other. When multiple intervals in ‘other’ overlap with the same interval in self, the result may be complex – read the argument ‘multiple’ for details.

Parameters:
  • other (PyRanges) – PyRanges to find overlaps with.

  • multiple ({"all", "first", "last"}, default "all") – What intervals to report when multiple intervals in ‘other’ overlap with the same interval in self. The default “all” reports all overlapping subintervals, which will have duplicate indices. “first” reports only, for each interval in self, the overlapping subinterval with smallest Start in ‘other’ “last” reports only the overlapping subinterval with the biggest End in ‘other’

  • strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A PyRanges with overlapping intervals. Input index is preserved, but may contain duplicates.

Return type:

PyRanges

See also

PyRanges.overlap

report overlapping (unmodified) intervals

PyRanges.subtract_overlaps

report non-overlapping subintervals

PyRanges.set_intersect_overlaps

set-intersect PyRanges

Examples

>>> r1 = pr.PyRanges({"Chromosome": ["chr1"] * 3, "Start": [5, 20, 40],"End": [10, 30, 50], "ID": ["a", "b", "c"]})
>>> r1
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                5       10  a
      1  |    chr1               20       30  b
      2  |    chr1               40       50  c
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> r2 = pr.PyRanges({"Chromosome": ["chr1"] * 4, "Start": [7, 18, 25, 28], "End": [9, 22, 33, 32]})
>>> r2
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                7        9
      1  |    chr1               18       22
      2  |    chr1               25       33
      3  |    chr1               28       32
PyRanges with 4 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> r1.intersect_overlaps(r2)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                7        9  a
      1  |    chr1               20       22  b
      1  |    chr1               25       30  b
      1  |    chr1               28       30  b
PyRanges with 4 rows, 4 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes.
>>> r1.intersect_overlaps(r2, multiple="first")
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                7        9  a
      1  |    chr1               20       22  b
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> r1.intersect_overlaps(r2, multiple="last")
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                7        9  a
      1  |    chr1               28       30  b
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
join_overlaps(other: PyRanges, *, multiple: Literal['first', 'all', 'last', 'contained'] = 'all', strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto', join_type: Literal['inner', 'left', 'outer', 'right'] = 'inner', match_by: str | Iterable[str] | None = None, contained_intervals_only: bool = False, slack: int = 0, suffix: str = '_b', report_overlap_column: str | None = None, preserve_input_order: bool = True) PyRanges

Join PyRanges based on genomic overlap.

Find pairs of overlapping intervals between two PyRanges (self and other) and combine their columns. Each row in the return PyRanges contains columns of both intervals, including their coordinates. By default, intervals without overlap are not reported.

Parameters:
  • other (PyRanges) – PyRanges to join.

  • strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.

  • join_type ({"inner", "left", "right", "outer"}, default "inner") – How to handle intervals without overlap. “inner” means only keep overlapping intervals. “left” keeps all intervals in self, “right” keeps all intervals in other, “outer” keeps both. For types other than “inner”, intervals in self without overlaps will have NaN in columns from other, and/or vice versa.

  • multiple ({"all", "first", "last"}, default "all") – What intervals to report when multiple intervals in ‘other’ overlap with the same interval in self. The default “all” reports all overlapping subintervals, which will have duplicate indices. “first” reports only, for each interval in self, the overlapping subinterval with smallest Start in ‘other’ “last” reports only the overlapping subinterval with the biggest End in ‘other’

  • contained_intervals_only (bool, default False) – Whether to report only intervals that are entirely contained in an interval of ‘other’.

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be joined.

  • report_overlap_column (str or None) – Report amount of overlap in base pairs using column name

  • slack (int, default 0) – Before joining, temporarily extend intervals in self by this much on both ends.

  • suffix (str or tuple, default "_b") – Suffix to give overlapping columns in other.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A PyRanges appended with columns of another.

Return type:

PyRanges

Note

The indices of the two input PyRanges are not preserved in output. The chromosome column from other will never be reported as it is always the same as in self. Whether the strand column from other is reported depends on the strand_behavior.

See also

PyRanges.combine_interval_columns

give joined PyRanges new coordinates

PyRanges.compute_interval_metrics

compute overlap metrics in joined PyRanges

Examples

>>> f1 = pr.PyRanges({'Chromosome': ['chr1', 'chr1', 'chr1'],
...                   'Start': [3, 8, 5],
...                   'End': [6, 9, 7],
...                   'Name': ['interval1', 'interval3', 'interval2']})
>>> f1
  index  |    Chromosome      Start      End  Name
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  ---------
      0  |    chr1                3        6  interval1
      1  |    chr1                8        9  interval3
      2  |    chr1                5        7  interval2
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> f2 = pr.PyRanges({'Chromosome': ['chr1', 'chr1'],
...                   'Start': [1, 6],
...                   'End': [2, 7],
...                   'Name': ['a', 'b']})
>>> f2
  index  |    Chromosome      Start      End  Name
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  ------
      0  |    chr1                1        2  a
      1  |    chr1                6        7  b
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> f1.join_overlaps(f2)
  index  |    Chromosome      Start      End  Name         Start_b    End_b  Name_b
  int64  |    str             int64    int64  str            int64    int64  str
-------  ---  ------------  -------  -------  ---------  ---------  -------  --------
      2  |    chr1                5        7  interval2          6        7  b
PyRanges with 1 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes.

Note that since some start and end columns are NaN, a regular DataFrame is returned.

>>> f1.join_overlaps(f2, join_type="left")
  index  |    Chromosome      Start      End  Name         Start_b      End_b  Name_b
  int64  |    str             int64    int64  str          float64    float64  str
-------  ---  ------------  -------  -------  ---------  ---------  ---------  --------
      2  |    chr1                5        7  interval2          6          7  b
      0  |    chr1                3        6  interval1        nan        nan  nan
      1  |    chr1                8        9  interval3        nan        nan  nan
PyRanges with 3 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes.
>>> f1.join_overlaps(f2, join_type="outer")
  index  |    Chromosome        Start        End  Name         Start_b      End_b  Name_b
  int64  |    str             float64    float64  str          float64    float64  str
-------  ---  ------------  ---------  ---------  ---------  ---------  ---------  --------
      1  |    chr1                  5          7  interval2          6          7  b
      0  |    chr1                  3          6  interval1        nan        nan  nan
      1  |    chr1                  8          9  interval3        nan        nan  nan
      0  |    nan                 nan        nan  nan                1          2  a
PyRanges with 4 rows, 7 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes.
Invalid ranges:
  * 1 starts or ends are nan. See indexes: 0
>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr2', 'chr1', 'chr3'],
...                   'Start': [1, 4, 10, 0],
...                   'End': [3, 9, 11, 1],
...                   'ID': ['a', 'b', 'c', 'd']})
>>> gr
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  a
      1  |    chr2                4        9  b
      2  |    chr1               10       11  c
      3  |    chr3                0        1  d
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 3 chromosomes.
>>> gr2 = pr.PyRanges({'Chromosome': ['chr1', 'chr1', 'chr1'],
...                    'Start': [2, 2, 1],
...                    'End': [3, 9, 10],
...                    'ID': ['a', 'b', 'c']})
>>> gr2
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                2        3  a
      1  |    chr1                2        9  b
      2  |    chr1                1       10  c
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.join_overlaps(gr2)
  index  |    Chromosome      Start      End  ID       Start_b    End_b  ID_b
  int64  |    str             int64    int64  str        int64    int64  str
-------  ---  ------------  -------  -------  -----  ---------  -------  ------
      0  |    chr1                1        3  a              2        3  a
      0  |    chr1                1        3  a              2        9  b
      0  |    chr1                1        3  a              1       10  c
PyRanges with 3 rows, 7 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes.
>>> gr.join_overlaps(gr2, match_by="ID")
  index  |    Chromosome      Start      End  ID       Start_b    End_b
  int64  |    str             int64    int64  str        int64    int64
-------  ---  ------------  -------  -------  -----  ---------  -------
      0  |    chr1                1        3  a              2        3
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes.
>>> bad = f1.join_overlaps(f2, join_type="right")
>>> bad
  index  |    Chromosome        Start        End  Name         Start_b    End_b  Name_b
  int64  |    str             float64    float64  str            int64    int64  str
-------  ---  ------------  ---------  ---------  ---------  ---------  -------  --------
      1  |    chr1                  5          7  interval2          6        7  b
      0  |    nan                 nan        nan  nan                1        2  a
PyRanges with 2 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes.
Invalid ranges:
  * 1 starts or ends are nan. See indexes: 0

With slack 1, bookended features are joined (see row 1):

>>> f1.join_overlaps(f2, slack=1)
  index  |    Chromosome      Start      End  Name         Start_b    End_b  Name_b
  int64  |    str             int64    int64  str            int64    int64  str
-------  ---  ------------  -------  -------  ---------  ---------  -------  --------
      0  |    chr1                3        6  interval1          6        7  b
      2  |    chr1                5        7  interval2          6        7  b
PyRanges with 2 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes.
>>> f1.join_overlaps(f2, report_overlap_column="Overlap")
  index  |    Chromosome      Start      End  Name         Start_b    End_b  Name_b      Overlap
  int64  |    str             int64    int64  str            int64    int64  str           int64
-------  ---  ------------  -------  -------  ---------  ---------  -------  --------  ---------
      2  |    chr1                5        7  interval2          6        7  b                 1
PyRanges with 1 rows, 8 columns, and 1 index columns.
Contains 1 chromosomes.

Allowing slack in overlaps may result in 0 or negative Overlap values:

>>> f1.join_overlaps(f2, report_overlap_column="Overlap", slack=2)
  index  |    Chromosome      Start      End  Name         Start_b    End_b  Name_b      Overlap
  int64  |    str             int64    int64  str            int64    int64  str           int64
-------  ---  ------------  -------  -------  ---------  ---------  -------  --------  ---------
      0  |    chr1                3        6  interval1          1        2  a                -1
      0  |    chr1                3        6  interval1          6        7  b                 0
      1  |    chr1                8        9  interval3          6        7  b                -1
      2  |    chr1                5        7  interval2          6        7  b                 1
PyRanges with 4 rows, 8 columns, and 1 index columns (with 1 index duplicates).
Contains 1 chromosomes.
property length: int

Return the total length of the intervals.

See also

PyRanges.lengths

return the intervals lengths

Examples

>>> gr = pr.example_data.f1
>>> gr
  index  |    Chromosome      Start      End  Name         Score  Strand
  int64  |    category        int64    int64  str          int64  category
-------  ---  ------------  -------  -------  ---------  -------  ----------
      0  |    chr1                3        6  interval1        0  +
      1  |    chr1                5        7  interval2        0  -
      2  |    chr1                8        9  interval3        0  +
PyRanges with 3 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.length
6

To find the length of the genome covered by the intervals, use merge first:

>>> gr.merge_overlaps(use_strand=False).length
5
lengths() Series

Return the length of each interval.

Return type:

pd.Series or dict of pd.Series with the lengths of each interval.

See also

PyRanges.length

return the total length of all intervals combined

Examples

>>> gr = pr.example_data.f1
>>> gr
  index  |    Chromosome      Start      End  Name         Score  Strand
  int64  |    category        int64    int64  str          int64  category
-------  ---  ------------  -------  -------  ---------  -------  ----------
      0  |    chr1                3        6  interval1        0  +
      1  |    chr1                5        7  interval2        0  -
      2  |    chr1                8        9  interval3        0  +
PyRanges with 3 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.lengths()
0    3
1    2
2    1
dtype: int64
>>> gr["Length"] = gr.lengths()
>>> gr
  index  |    Chromosome      Start      End  Name         Score  Strand        Length
  int64  |    category        int64    int64  str          int64  category       int64
-------  ---  ------------  -------  -------  ---------  -------  ----------  --------
      0  |    chr1                3        6  interval1        0  +                  3
      1  |    chr1                5        7  interval2        0  -                  2
      2  |    chr1                8        9  interval3        0  +                  1
PyRanges with 3 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
property loc_columns: list[str]

Return the names of genomic location columns of this PyRanges (Chromosome, and Strand if present).

property loci: LociGetter

Get or set rows based on genomic location.

Parameters:

key – Genomic location: one or more of Chromosome, Strand, and Range (i.e. Start:End). When a Range is specified, only rows that overlap with it are returned.

Returns:

PyRanges view with rows matching the location.

Return type:

PyRanges

Warning

When strand is provided but chromosome is not, only valid strand values (‘+’, ‘-’) are searched for. Use the complete .loci[Chromosome, Strand, Range] syntax to search for non-genomic strands. Each item can be None to match all values.

Examples

>>> import pyranges1 as pr
>>> gr = pr.example_data.ensembl_gtf.get_with_loc_columns(["gene_id", "gene_name"])
>>> gr
index    |    Chromosome    Start    End      Strand      gene_id          gene_name
int64    |    category      int64    int64    category    str              str
-------  ---  ------------  -------  -------  ----------  ---------------  -----------
0        |    1             11868    14409    +           ENSG00000223972  DDX11L1
1        |    1             11868    14409    +           ENSG00000223972  DDX11L1
2        |    1             11868    12227    +           ENSG00000223972  DDX11L1
3        |    1             12612    12721    +           ENSG00000223972  DDX11L1
...      |    ...           ...      ...      ...         ...              ...
7        |    1             120724   133723   -           ENSG00000238009  AL627309.1
8        |    1             133373   133723   -           ENSG00000238009  AL627309.1
9        |    1             129054   129223   -           ENSG00000238009  AL627309.1
10       |    1             120873   120932   -           ENSG00000238009  AL627309.1
PyRanges with 11 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.loci[1, "+", 12227:13000]
  index  |      Chromosome    Start      End  Strand      gene_id          gene_name
  int64  |        category    int64    int64  category    str              str
-------  ---  ------------  -------  -------  ----------  ---------------  -----------
      0  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      1  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      3  |               1    12612    12721  +           ENSG00000223972  DDX11L1
PyRanges with 3 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> gr.loci[1, 14408:120000]
  index  |      Chromosome    Start      End  Strand      gene_id          gene_name
  int64  |        category    int64    int64  category    str              str
-------  ---  ------------  -------  -------  ----------  ---------------  -----------
      0  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      1  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      4  |               1    13220    14409  +           ENSG00000223972  DDX11L1
      5  |               1   112699   112804  -           ENSG00000238009  AL627309.1
      6  |               1   110952   111357  -           ENSG00000238009  AL627309.1
PyRanges with 5 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.loci[1, "-"]
  index  |      Chromosome    Start      End  Strand      gene_id          gene_name
  int64  |        category    int64    int64  category    str              str
-------  ---  ------------  -------  -------  ----------  ---------------  -----------
      5  |               1   112699   112804  -           ENSG00000238009  AL627309.1
      6  |               1   110952   111357  -           ENSG00000238009  AL627309.1
      7  |               1   120724   133723  -           ENSG00000238009  AL627309.1
      8  |               1   133373   133723  -           ENSG00000238009  AL627309.1
      9  |               1   129054   129223  -           ENSG00000238009  AL627309.1
     10  |               1   120873   120932  -           ENSG00000238009  AL627309.1
PyRanges with 6 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> gr.loci["+"]
  index  |      Chromosome    Start      End  Strand      gene_id          gene_name
  int64  |        category    int64    int64  category    str              str
-------  ---  ------------  -------  -------  ----------  ---------------  -----------
      0  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      1  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      2  |               1    11868    12227  +           ENSG00000223972  DDX11L1
      3  |               1    12612    12721  +           ENSG00000223972  DDX11L1
      4  |               1    13220    14409  +           ENSG00000223972  DDX11L1
PyRanges with 5 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> gr.loci[11000:12000]
  index  |      Chromosome    Start      End  Strand      gene_id          gene_name
  int64  |        category    int64    int64  category    str              str
-------  ---  ------------  -------  -------  ----------  ---------------  -----------
      0  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      1  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      2  |               1    11868    12227  +           ENSG00000223972  DDX11L1
PyRanges with 3 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

The Chromosome column is attempted to be converted to the type of the provided key before matching:

>>> gr.loci["1", 11000:12000]
  index  |      Chromosome    Start      End  Strand      gene_id          gene_name
  int64  |        category    int64    int64  category    str              str
-------  ---  ------------  -------  -------  ----------  ---------------  -----------
      0  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      1  |               1    11868    14409  +           ENSG00000223972  DDX11L1
      2  |               1    11868    12227  +           ENSG00000223972  DDX11L1
PyRanges with 3 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

When requesting non-existing chromosome or strand or ranges an empty PyRanges is returned:

>>> gr.loci["3"]
index    |    Chromosome    Start    End      Strand      gene_id    gene_name
int64    |    category      int64    int64    category    str        str
-------  ---  ------------  -------  -------  ----------  ---------  -----------
PyRanges with 0 rows, 6 columns, and 1 index columns.
Contains 0 chromosomes and 0 strands.
>>> gr2 = pr.PyRanges({"Chromosome": ["chr1", "chr2"], "Start": [1, 2], "End": [4, 5],
...                    "Strand": [".", "+"], "Score":[10, 12], "Id":["a", "b"]})
>>> gr2.loci["chr2"]
  index  |    Chromosome      Start      End  Strand      Score  Id
  int64  |    str             int64    int64  str         int64  str
-------  ---  ------------  -------  -------  --------  -------  -----
      1  |    chr2                2        5  +              12  b
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

The loci operator can also be used for assignment, using a same-sized PyRanges:

>>> gr2.loci["chr2"] = gr2.loci["chr2"].copy().assign(Chromosome="xxx")
>>> gr2
  index  |    Chromosome      Start      End  Strand      Score  Id
  int64  |    str             int64    int64  str         int64  str
-------  ---  ------------  -------  -------  --------  -------  -----
      0  |    chr1                1        4  .              10  a
      1  |    xxx                 2        5  +              12  b
PyRanges with 2 rows, 6 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands (including non-genomic strands: .).

For more flexible assignment, you can employ Pandas loc using the index of the loci output:

>>> c = gr2.loci["chr1"]
>>> gr2.loc[c.index, "Score"] = 100
>>> gr2
  index  |    Chromosome      Start      End  Strand      Score  Id
  int64  |    str             int64    int64  str         int64  str
-------  ---  ------------  -------  -------  --------  -------  -----
      0  |    chr1                1        4  .             100  a
      1  |    xxx                 2        5  +              12  b
PyRanges with 2 rows, 6 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands (including non-genomic strands: .).

When providing only strand, or strand and a slice, only valid genomic strands (i.e. ‘+’, ‘-’) are searched for:

>>> gr2.loci['+']
  index  |    Chromosome      Start      End  Strand      Score  Id
  int64  |    str             int64    int64  str         int64  str
-------  ---  ------------  -------  -------  --------  -------  -----
      1  |    xxx                 2        5  +              12  b
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> gr2.loci['.']
index    |    Chromosome    Start    End      Strand    Score    Id
int64    |    str           int64    int64    str       int64    str
-------  ---  ------------  -------  -------  --------  -------  -----
PyRanges with 0 rows, 6 columns, and 1 index columns.
Contains 0 chromosomes and 0 strands.

You can use None to match all values, useful to force the non-ambiguous syntax that can match non-genomic strands:

>>> gr2.loci[None, '.']
  index  |    Chromosome      Start      End  Strand      Score  Id
  int64  |    str             int64    int64  str         int64  str
-------  ---  ------------  -------  -------  --------  -------  -----
      0  |    chr1                1        4  .             100  a
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands (including non-genomic strands: .).

Do not try to use loci to access columns: the key is interpreted as a chromosome, resulting in empty output:

>>> gr2.loci["Score"]
index    |    Chromosome    Start    End      Strand    Score    Id
int64    |    str           int64    int64    str       int64    str
-------  ---  ------------  -------  -------  --------  -------  -----
PyRanges with 0 rows, 6 columns, and 1 index columns.
Contains 0 chromosomes and 0 strands.
>>> gr2.loci[["Score", "Id"]]
Traceback (most recent call last):
...
TypeError: The loci accessor does not accept a list. If you meant to retrieve columns, use get_with_loc_columns instead.
make_strand_valid() PyRanges

Make the strand information in PyRanges valid.

Convert all invalid Strand values (those other than “+” and “-”) to positive stranded values “+”. If the Strand column is not present, add it with all values set to “+”.

Returns:

PyRanges with valid strand information.

Return type:

PyRanges

See also

PyRanges.strand_valid

whether PyRanges has valid strand info

PyRanges.remove_strand

remove the Strand column from PyRanges

Examples

>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr1'], 'Start': [1, 6],
...                   'End': [5, 8], 'Strand': ['-', '.']})
>>> gr
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        5  -
      1  |    chr1                6        8  .
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands (including non-genomic strands: .).
>>> gr.strand_valid
False
>>> gr.make_strand_valid()
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        5  -
      1  |    chr1                6        8  +
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr2 = pr.PyRanges({'Chromosome': ['chr1', 'chr1'], 'Start': [5, 22],
...                    'End': [15, 30]})
>>> gr2
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                5       15
      1  |    chr1               22       30
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr2.make_strand_valid()
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                5       15  +
      1  |    chr1               22       30  +
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
map_to_global(gr: PyRanges, global_on: str, *, local_on: str = 'Chromosome', keep_id: bool = False, keep_loc: bool = False, pep_to_cds: bool = False) PyRanges

Map intervals from a local reference frame (e.g. transcript) to global coordinates (e.g. genomic).

The self PyRanges object is local in the sense that its Chromosome column stores an identifier (e.g. a transcript ID), so that its interval coordinates are expressed relative to that identifier. The global object gr supplies the absolute genomic coordinates of every interval group (e.g. annotation of transcripts, potentially with multiple exons). The function returns the self intervals in the reference system of the global object.

The strand of returned intervals is the product of the strand of the corresponding local and global intervals (e.g. +/- => -)

Unused rows in gr (identifiers never referenced by self) are ignored.

Parameters:
  • gr (PyRanges) – Intervals in global reference system (e.g. transcript annotation in genomic coordinates).

  • global_on (str) – Column in gr that holds the identifiers contained in self.Chromosome.

  • local_on (str, default "Chromosome") – Column in self that holds the identifier to be lifted. Change this if your identifiers live in a different column.

  • keep_id (bool, default False) – If True, keep the identifier column (Chromosome in self) in the output.

  • keep_loc (bool, default False) – If True, keep the local location columns (Start, End, Strand) in the output.

  • pep_to_cds (bool, default False) – If True, the function will assume that the intervals in self are peptide coordinates and those in gr are CDS coordinates. Thus, self coordinates are multiplied by 3 before mapping to global.

Returns:

Intervals in genomic coordinates, maintaining order, index, and metadata columns of self.

Return type:

PyRanges

Warning

A single local interval will give rise to multiple intervals in output if it overlaps discontinuities (i.e. introns) in global coordinates. This will generate duplicated indices. To avoid them, run pandas dataframe method reset_index on the output.

Examples

>>> gr = pr.PyRanges(pd.DataFrame({
...     "Chromosome": ["chr1","chr1","chr1","chr1"],
...     "Start": [100, 300, 1000, 1100],
...     "End": [200, 400, 1050, 1200],
...     "Strand": ["+","+", "-", "-"],
...     "transcript_id": ["tx1","tx1","tx2","tx2"],
... }))
>>> tr = pr.PyRanges(pd.DataFrame({
...     "Chromosome": ["tx1","tx1","tx1","tx2","tx2"],
...     "Start": [0, 120, 160, 0, 100],
...     "End": [80, 140, 170, 20, 130],
...     "Strand": ["-","-", "+", "+", "+"],
...     "label": ["a","b","c","d","e"],
... }))
>>> gr
  index  |    Chromosome      Start      End  Strand    transcript_id
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ---------------
      0  |    chr1              100      200  +         tx1
      1  |    chr1              300      400  +         tx1
      2  |    chr1             1000     1050  -         tx2
      3  |    chr1             1100     1200  -         tx2
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> tr
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -------
      0  |    tx1                 0       80  -         a
      1  |    tx1               120      140  -         b
      2  |    tx1               160      170  +         c
      3  |    tx2                 0       20  +         d
      4  |    tx2               100      130  +         e
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.
>>> tr.map_to_global(gr, "transcript_id")
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1              100      180  -         a
      1  |    chr1              320      340  -         b
      2  |    chr1              360      370  +         c
      3  |    chr1             1180     1200  -         d
      4  |    chr1             1020     1050  -         e
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> tr.map_to_global(gr, "transcript_id", keep_id=True)
  index  |    Chromosome      Start      End  Strand    label    transcript_id
  int64  |    str             int64    int64  str       str      str
-------  ---  ------------  -------  -------  --------  -------  ---------------
      0  |    chr1              100      180  -         a        tx1
      1  |    chr1              320      340  -         b        tx1
      2  |    chr1              360      370  +         c        tx1
      3  |    chr1             1180     1200  -         d        tx2
      4  |    chr1             1020     1050  -         e        tx2
PyRanges with 5 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> tr.map_to_global(gr, "transcript_id", keep_loc=True)
  index  |    Chromosome      Start      End  Strand    label      Start_local    End_local  Strand_local
  int64  |    str             int64    int64  str       str              int64        int64  str
-------  ---  ------------  -------  -------  --------  -------  -------------  -----------  --------------
      0  |    chr1              100      180  -         a                    0           80  -
      1  |    chr1              320      340  -         b                  120          140  -
      2  |    chr1              360      370  +         c                  160          170  +
      3  |    chr1             1180     1200  -         d                    0           20  +
      4  |    chr1             1020     1050  -         e                  100          130  +
PyRanges with 5 rows, 8 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Metadata columns are preserved:

>>> tr.assign(tag=7).map_to_global(gr, "transcript_id").tag.unique()
array([7])

A local interval that spans an exon junction is split; its index is duplicated in the output.

>>> tr2 = pr.PyRanges(pd.DataFrame({
...     "Chromosome":["tx1","tx2","tx2"],
...     "Start": [90, 80, 50],
...     "End": [110, 120, 120],
...     "Strand": ["+","+", "-"],
...     "label": ["q","w","e"],
... }))
>>> tr2.map_to_global(gr, "transcript_id")
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1              190      200  +         q
      0  |    chr1              300      310  +         q
      1  |    chr1             1030     1050  -         w
      1  |    chr1             1100     1120  -         w
      2  |    chr1             1030     1050  +         e
      2  |    chr1             1100     1150  +         e
PyRanges with 6 rows, 5 columns, and 1 index columns (with 3 index duplicates).
Contains 1 chromosomes and 2 strands.

A local interval longer than its transcript is truncated to the portion that fits.

>>> tr3 = pr.PyRanges(pd.DataFrame({
...     "Chromosome":["tx1"], "Start":[20], "End":[1000], "Strand":["+"]
... }))
>>> tr3.map_to_global(gr, "transcript_id")
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1              120      200  +
      0  |    chr1              300      400  +
PyRanges with 2 rows, 4 columns, and 1 index columns (with 1 index duplicates).
Contains 1 chromosomes and 1 strands.

Mapping proteins to CDS global positions. Get some protein-based features:

>>> gr = pr.example_data.ncbi_gff
>>> grc = gr[gr.Feature == "CDS"].get_with_loc_columns('Parent')
>>> genome_file = pr.example_data.files['ncbi.fasta']
>>> cdss = grc.get_sequence(genome_file, group_by='Parent').str.upper()
>>> prots = pr.seqs.translate(cdss)
>>> z = [(seq_id, i, 'R') for seq_id, seq in prots.items() for i, char in enumerate(seq)  if char == 'R']
>>> z = pd.DataFrame(z, columns=['ID', 'Start', 'AminoAcid']).assign(End=lambda df: df.Start + 1 )
>>> arginines_pos =  pr.PyRanges(z.rename(columns={"ID": "Chromosome"}))
>>> arginines_pos
index    |    Chromosome             Start    AminoAcid    End
int64    |    str                    int64    str          int64
-------  ---  ---------------------  -------  -----------  -------
0        |    rna-DGYR_LOCUS12552    7        R            8
1        |    rna-DGYR_LOCUS12552    15       R            16
2        |    rna-DGYR_LOCUS12552    25       R            26
3        |    rna-DGYR_LOCUS12552    89       R            90
...      |    ...                    ...      ...          ...
158      |    rna-DGYR_LOCUS14095-2  271      R            272
159      |    rna-DGYR_LOCUS14095-2  309      R            310
160      |    rna-DGYR_LOCUS14095-2  320      R            321
161      |    rna-DGYR_LOCUS14095-2  327      R            328
PyRanges with 162 rows, 4 columns, and 1 index columns.
Contains 17 chromosomes.
>>> genome_arginine_pos = arginines_pos.map_to_global(grc, "Parent", pep_to_cds=True, keep_id=True)
>>> genome_arginine_pos['codon'] = genome_arginine_pos.get_sequence(genome_file).str.upper()
>>> genome_arginine_pos
index    |    Chromosome         Start    AminoAcid    End      Parent                 Strand      codon
int64    |    category           int64    str          int64    str                    category    object
-------  ---  -----------------  -------  -----------  -------  ---------------------  ----------  --------
0        |    CAJFCJ010000025.1  2671     R            2674     rna-DGYR_LOCUS12552    -           CGT
1        |    CAJFCJ010000025.1  2647     R            2650     rna-DGYR_LOCUS12552    -           AGA
2        |    CAJFCJ010000025.1  2617     R            2620     rna-DGYR_LOCUS12552    -           AGA
3        |    CAJFCJ010000025.1  2369     R            2372     rna-DGYR_LOCUS12552    -           AGA
...      |    ...                ...      ...          ...      ...                    ...         ...
159      |    CAJFCJ010000097.1  52993    R            52996    rna-DGYR_LOCUS14095-2  +           AGA
160      |    CAJFCJ010000097.1  53026    R            53027    rna-DGYR_LOCUS14095-2  +           A
160      |    CAJFCJ010000097.1  53339    R            53341    rna-DGYR_LOCUS14095-2  +           GA
161      |    CAJFCJ010000097.1  53359    R            53362    rna-DGYR_LOCUS14095-2  +           AGA
PyRanges with 164 rows, 7 columns, and 1 index columns (with 2 index duplicates).
Contains 3 chromosomes and 2 strands.
map_to_local(ref, ref_on, *, match_by: str | Iterable[str] | None = None, keep_chrom: bool = False, keep_loc: bool = False) PyRanges

Map global genomic intervals (self) onto a local frame defined by reference ranges ref.

Both self and ref are given in genomic coordinates. ref holds the layout of every local entity (typically the exons that compose each transcript). Each interval in self that overlap with ref is re-based so that the returned Start/End are measured from the 5’ end of the entire transcript of ref, with introns removed. For instance, if the first exon of a ref transcript is 100 nt long, the first base of the second exon has local coordinate 100.

Intervals in self are mapped to every transcript they overlap; non-overlapping intervals are not reported. ref must contain a column whose name is supplied in ref_on. Those values become the Chromosome column of the output.

The strand of each returned interval is the “product” of the strands of the overlapping pair (e.g. + x. --).

Parameters:
  • ref (PyRanges) – Reference ranges in genomic coordinates that define the new coordinate system (e.g. multi-exon transcript annotation).

  • ref_on (str) – Column in ref that groups intervals into transcripts. Values are copied into Chromosome in the output.

  • match_by (str or list[str], optional) – If provided, only overlapping intervals with an equal value in column(s) match_by are reported.

  • keep_chrom (bool, default False) – If True, keep the global Chromosome column in the output.

  • keep_loc (bool, default False) – If True, keep the original global location columns (Start, End, Strand) in the output.

Returns:

Intervals remapped to local (transcript) coordinates, preserving the original row order, index and metadata columns of self.

Return type:

PyRanges

Warning

*A single self interval may overlap several ref exons, or different transcripts. In that case its index repeats in the output. Call reset_index() afterwards if you need unique indices.

Examples

>>> import pandas as pd, pyranges1 as pr
>>> tr = pr.PyRanges(pd.DataFrame({
...     "Chromosome":   ["chr1","chr1","chr1","chr1"],
...     "Start":        [  100,   300,   1000, 1100],
...     "End":          [  200,   400,   1050, 1200],
...     "Strand":       ["+","+", "-","-"],
...     "transcript_id":["tx1","tx1","tx2","tx2"],
... }))
>>> g1 = pr.PyRanges(pd.DataFrame({
...     "Chromosome": ["chr1","chr1","chr1","chr1","chr1","chr1","chr1"],
...     "Start":      [ 110,   220,   320,   340,   500,  1030, 1180],
...     "End":        [ 180,   240,   340,   360,   550,  1050, 1200],
...     "Strand":     ["+","+", "+",  "-","+",   "-","+"],
...     "label":      ["a","no_overlap_intronic","b","c",
...                    "no_overlap_intergenic","d","e"],
... }))
>>> tr
  index  |    Chromosome      Start      End  Strand    transcript_id
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ---------------
      0  |    chr1              100      200  +         tx1
      1  |    chr1              300      400  +         tx1
      2  |    chr1             1000     1050  -         tx2
      3  |    chr1             1100     1200  -         tx2
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> g1
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ---------------------
      0  |    chr1              110      180  +         a
      1  |    chr1              220      240  +         no_overlap_intronic
      2  |    chr1              320      340  +         b
      3  |    chr1              340      360  -         c
      4  |    chr1              500      550  +         no_overlap_intergenic
      5  |    chr1             1030     1050  -         d
      6  |    chr1             1180     1200  +         e
PyRanges with 7 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> g1.map_to_local(tr, "transcript_id")
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -------
      0  |    tx1                10       80  +         a
      2  |    tx1               120      140  +         b
      3  |    tx1               140      160  -         c
      5  |    tx2               100      120  +         d
      6  |    tx2                 0       20  -         e
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.

A genomic interval spanning two exons is split:

>>> g2 = pr.PyRanges(pd.DataFrame({
...     "Chromosome":["chr1"], "Start":[180], "End":[330],
...     "Strand":["+"], "label":["q"]}))
>>> g2.map_to_local(tr, "transcript_id")
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -------
      0  |    tx1                80      100  +         q
      0  |    tx1               100      130  +         q
PyRanges with 2 rows, 5 columns, and 1 index columns (with 1 index duplicates).
Contains 1 chromosomes and 1 strands.

Self intervals that overlaps multiple target ranges are reported as many times:

>>> tr2 = pr.PyRanges(pd.DataFrame({
...      "Chromosome":   ["chr1","chr1","chr1","chr1"],
...      "Start":        [  100,   300,   110, 300],
...      "End":          [  200,   400,   200, 380],
...      "Strand":       ["+","+","-","-"],
...      "transcript_id":["tx1.1","tx1.1","tx1.2","tx1.2"],
... }))
>>> g3 = pr.PyRanges(pd.DataFrame({
...     "Chromosome": ["chr1"], "Start": [150], "End": [180], "Strand": ["+"], "label": ["x"]}))
>>> g3.map_to_local(tr2, "transcript_id")
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -------
      0  |    tx1.1              50       80  +         x
      0  |    tx1.2             100      130  -         x
PyRanges with 2 rows, 5 columns, and 1 index columns (with 1 index duplicates).
Contains 2 chromosomes and 2 strands.
>>> g3.map_to_local(tr2, "transcript_id", keep_chrom=True)
  index  |    Chromosome      Start      End  Strand    label    Chromosome_global
  int64  |    str             int64    int64  str       str      str
-------  ---  ------------  -------  -------  --------  -------  -------------------
      0  |    tx1.1              50       80  +         x        chr1
      0  |    tx1.2             100      130  -         x        chr1
PyRanges with 2 rows, 6 columns, and 1 index columns (with 1 index duplicates).
Contains 2 chromosomes and 2 strands.
>>> g3.map_to_local(tr2, "transcript_id", keep_loc=True)
  index  |    Chromosome      Start      End  Strand    label      Start_global    End_global  Strand_global
  int64  |    str             int64    int64  str       str               int64         int64  str
-------  ---  ------------  -------  -------  --------  -------  --------------  ------------  ---------------
      0  |    tx1.1              50       80  +         x                   100           200  +
      0  |    tx1.2             100      130  -         x                   110           200  -
PyRanges with 2 rows, 8 columns, and 1 index columns (with 1 index duplicates).
Contains 2 chromosomes and 2 strands.

Explicitly restrict what to report using match_by:

>>> g4 = g3.assign(transcript_id="tx1.1")
>>> g4.map_to_local(tr2, "transcript_id", match_by="transcript_id")
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -------
      0  |    tx1.1              50       80  +         x
PyRanges with 1 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> tr3 = tr2.copy()
>>> tr3["label"] = ["x", "b", "c", "d"]
>>> g4.map_to_local(tr3, "transcript_id", match_by="label")
  index  |    Chromosome      Start      End  Strand    label    transcript_id
  int64  |    str             int64    int64  str       str      str
-------  ---  ------------  -------  -------  --------  -------  ---------------
      0  |    tx1.1              50       80  +         x        tx1.1
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
max_disjoint_overlaps(use_strand: Literal['auto'] | bool = 'auto', *, slack: int = 0, match_by: str | Iterable[str] | None = None, preserve_input_order: bool = True) PyRanges

Find the maximal disjoint set of intervals.

Returns a subset of the rows in self so that no two intervals overlap, choosing those that maximize the number of intervals in the result.

Parameters:
  • use_strand ({"auto", True, False}, default: "auto") – Find the max-disjoint set separately for each strand. The default "auto" means True if PyRanges has valid strands (see .strand_valid).

  • slack (int, default 0) – Length by which the criteria of overlap are loosened. A value of 1 implies that book-ended intervals are considered overlapping. Higher values allow more distant intervals (with a maximum distance of slack-1 between them).

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A PyRanges containing the maximal disjoint set of intervals.

Return type:

PyRanges

See also

PyRanges.merge_overlaps

merge intervals into non-overlapping super-intervals

PyRanges.split_overlaps

split intervals into non-overlapping sub-intervals

PyRanges.cluster

annotate overlapping intervals with a common ID

Examples

>>> gr = pr.example_data.f1
>>> gr
  index  |    Chromosome      Start      End  Name         Score  Strand
  int64  |    category        int64    int64  str          int64  category
-------  ---  ------------  -------  -------  ---------  -------  ----------
      0  |    chr1                3        6  interval1        0  +
      1  |    chr1                5        7  interval2        0  -
      2  |    chr1                8        9  interval3        0  +
PyRanges with 3 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.max_disjoint_overlaps(use_strand=False)
  index  |    Chromosome      Start      End  Name         Score  Strand
  int64  |    category        int64    int64  str          int64  category
-------  ---  ------------  -------  -------  ---------  -------  ----------
      0  |    chr1                3        6  interval1        0  +
      2  |    chr1                8        9  interval3        0  +
PyRanges with 2 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

Strand-aware selection

>>> c = pr.PyRanges(dict(
...     Chromosome=["chr1"] * 8,
...     Start=[1, 4, 10, 12, 19, 20, 24, 28],
...     End=[5, 7, 14, 16, 27, 22, 25, 30],
...     Strand=["+", "+", "+", "-", "+", "+", "+", "+"]
... ))
>>> c
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        5  +
      1  |    chr1                4        7  +
      2  |    chr1               10       14  +
      3  |    chr1               12       16  -
      4  |    chr1               19       27  +
      5  |    chr1               20       22  +
      6  |    chr1               24       25  +
      7  |    chr1               28       30  +
PyRanges with 8 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> c.max_disjoint_overlaps(use_strand=True)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        5  +
      2  |    chr1               10       14  +
      3  |    chr1               12       16  -
      4  |    chr1               19       27  +
      7  |    chr1               28       30  +
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Using match_by to exempt rows from mutual overlap

>>> c3 = c.copy()
>>> c3["label"] = [f"x{i}" for i in range(len(c3))]
>>> c3.max_disjoint_overlaps(match_by="label")
  index  |    Chromosome      Start      End  Strand    label
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1                1        5  +         x0
      1  |    chr1                4        7  +         x1
      2  |    chr1               10       14  +         x2
      3  |    chr1               12       16  -         x3
      4  |    chr1               19       27  +         x4
      5  |    chr1               20       22  +         x5
      6  |    chr1               24       25  +         x6
      7  |    chr1               28       30  +         x7
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
merge_overlaps(use_strand: Literal['auto'] | bool = 'auto', *, count_col: str | None = None, match_by: str | Iterable[str] | None = None, slack: int = 0) PyRanges

Merge overlapping intervals into one.

Returns a PyRanges with superintervals that are the union of overlapping intervals.

Parameters:
  • use_strand ({"auto", True, False}, default: "auto") – Only merge intervals on same strand. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

  • count_col (str or None, default None) – Name of the column storing the number of intervals merged into each superinterval. If None, no count column is added.

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • slack (int, default 0) – Allow this many nucleotides between each interval to merge.

Returns:

PyRanges with superintervals. Metadata columns, index, and order are not preserved.

Return type:

PyRanges

Note

To avoid losing metadata, use cluster instead. If you want to perform a reduction function on the metadata, use pandas groupby.

See also

PyRanges.cluster

annotate overlapping intervals with common ID

PyRanges.max_disjoint_overlaps

find the maximal disjoint set of intervals

PyRanges.split_overlaps

split intervals into non-overlapping subintervals

Examples

>>> gr = pr.example_data.ensembl_gtf.get_with_loc_columns(["Feature", "gene_name"])
>>> gr
index    |    Chromosome    Start    End      Strand      Feature     gene_name
int64    |    category      int64    int64    category    category    str
-------  ---  ------------  -------  -------  ----------  ----------  -----------
0        |    1             11868    14409    +           gene        DDX11L1
1        |    1             11868    14409    +           transcript  DDX11L1
2        |    1             11868    12227    +           exon        DDX11L1
3        |    1             12612    12721    +           exon        DDX11L1
...      |    ...           ...      ...      ...         ...         ...
7        |    1             120724   133723   -           transcript  AL627309.1
8        |    1             133373   133723   -           exon        AL627309.1
9        |    1             129054   129223   -           exon        AL627309.1
10       |    1             120873   120932   -           exon        AL627309.1
PyRanges with 11 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.merge_overlaps(count_col="Count")
  index  |      Chromosome    Start      End  Strand         Count
  int64  |        category    int64    int64  category      uint32
-------  ---  ------------  -------  -------  ----------  --------
      0  |               1    11868    14409  +                  5
      1  |               1   110952   111357  -                  1
      2  |               1   112699   112804  -                  1
      3  |               1   120724   133723  -                  4
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.merge_overlaps(count_col="Count", match_by="gene_name")
  index  |      Chromosome    Start      End  Strand      gene_name       Count
  int64  |        category    int64    int64  category    str            uint32
-------  ---  ------------  -------  -------  ----------  -----------  --------
      0  |               1    11868    14409  +           DDX11L1             5
      1  |               1   110952   111357  -           AL627309.1          1
      2  |               1   112699   112804  -           AL627309.1          1
      3  |               1   120724   133723  -           AL627309.1          4
PyRanges with 4 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
nearest_ranges(other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto', direction: Literal['any', 'upstream', 'downstream'] = 'any', *, k: int = 1, match_by: str | Iterable[str] | None = None, suffix: str = '_b', exclude_overlaps: bool = False, dist_col: str | None = 'Distance', preserve_input_order: bool = True) PyRanges

Find closest interval.

For each interval in self PyRanges, the columns of the nearest interval in other PyRanges are appended.

Parameters:
  • other (PyRanges) – PyRanges to find nearest interval in.

  • strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.

  • exclude_overlaps (bool, default True) – Whether to not report intervals of others that overlap with self as the nearest ones.

  • direction ({"any", "upstream", "downstream"}, default "any", i.e. both directions) – Whether to only look for nearest in one direction.

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be matched.

  • k (int, default 1) – Number of nearest intervals to fetch.

  • suffix (str, default "_b") – Suffix to give columns with shared name in other.

  • dist_col (str or None) – Optional column to store the distance in.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A PyRanges with columns representing nearest interval horizontally appended.

Return type:

PyRanges

See also

PyRanges.join_overlaps

Has a slack argument to find intervals within a distance.

Examples

>>> f1 = pr.example_data.f1.remove_nonloc_columns()
>>> f1
  index  |    Chromosome      Start      End  Strand
  int64  |    category        int64    int64  category
-------  ---  ------------  -------  -------  ----------
      0  |    chr1                3        6  +
      1  |    chr1                5        7  -
      2  |    chr1                8        9  +
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> f2 = pr.PyRanges(dict(Chromosome="chr1", Start=[1, 6, 20], End=[2, 7, 22], Strand=["+", "-", "+"]))
>>> f2
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        2  +
      1  |    chr1                6        7  -
      2  |    chr1               20       22  +
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> f1.nearest_ranges(f2)
  index  |    Chromosome      Start      End  Strand      Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    category        int64    int64  category    str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  ----------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +           chr1                    1        2  +                    2
      1  |    chr1                5        7  -           chr1                    6        7  -                    0
      2  |    chr1                8        9  +           chr1                    1        2  +                    7
PyRanges with 3 rows, 9 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> f1.nearest_ranges(f2, strand_behavior='ignore')
  index  |    Chromosome      Start      End  Strand      Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    category        int64    int64  category    str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  ----------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +           chr1                    6        7  -                    1
      1  |    chr1                5        7  -           chr1                    6        7  -                    0
      2  |    chr1                8        9  +           chr1                    6        7  -                    2
PyRanges with 3 rows, 9 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> f1.nearest_ranges(f2, k=2, strand_behavior='ignore')
  index  |    Chromosome      Start      End  Strand      Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    category        int64    int64  category    str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  ----------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +           chr1                    6        7  -                    1
      0  |    chr1                3        6  +           chr1                    1        2  +                    2
      1  |    chr1                5        7  -           chr1                    6        7  -                    0
      1  |    chr1                5        7  -           chr1                    1        2  +                    4
      2  |    chr1                8        9  +           chr1                    6        7  -                    2
      2  |    chr1                8        9  +           chr1                    1        2  +                    7
PyRanges with 6 rows, 9 columns, and 1 index columns (with 3 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> left = pr.PyRanges({"Chromosome": ["chr1", "chr1"], "Start": [10, 1], "End": [14, 2]})
>>> right = pr.PyRanges(
...     {"Chromosome": ["chr1", "chr1"], "Start": [4, 1], "End": [6, 6], "Hit": ["first", "second"]}
... )
>>> left.nearest_ranges(right, strand_behavior="ignore", dist_col=None)
  index  |    Chromosome      Start      End  Chromosome_b      Start_b    End_b  Hit_b
  int64  |    str             int64    int64  str                 int64    int64  str
-------  ---  ------------  -------  -------  --------------  ---------  -------  -------
      0  |    chr1               10       14  chr1                    4        6  first
      0  |    chr1               10       14  chr1                    1        6  second
      1  |    chr1                1        2  chr1                    1        6  second
PyRanges with 3 rows, 7 columns, and 1 index columns (with 1 index duplicates).
Contains 1 chromosomes.
>>> left.nearest_ranges(right, strand_behavior="ignore", dist_col=None, preserve_input_order=False)
  index  |    Chromosome      Start      End  Chromosome_b      Start_b    End_b  Hit_b
  int64  |    str             int64    int64  str                 int64    int64  str
-------  ---  ------------  -------  -------  --------------  ---------  -------  -------
      0  |    chr1               10       14  chr1                    1        6  second
      0  |    chr1               10       14  chr1                    4        6  first
      1  |    chr1                1        2  chr1                    1        6  second
PyRanges with 3 rows, 7 columns, and 1 index columns (with 1 index duplicates).
Contains 1 chromosomes.
>>> f1.nearest_ranges(f2, strand_behavior='ignore', exclude_overlaps=True)
  index  |    Chromosome      Start      End  Strand      Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    category        int64    int64  category    str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  ----------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +           chr1                    6        7  -                    1
      1  |    chr1                5        7  -           chr1                    1        2  +                    4
      2  |    chr1                8        9  +           chr1                    6        7  -                    2
PyRanges with 3 rows, 9 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Touching intervals are not treated as overlapping, so they are reported as nearest intervals with distance 1:

>>> r1 = pr.PyRanges({"Chromosome": "chr1", "Start": [1], "End": [5], "Strand": ["+"]})
>>> r2 = pr.PyRanges({"Chromosome": "chr1", "Start": [5], "End": [10], "Strand": ["+"]})
>>> pd.DataFrame(r2.nearest_ranges(r1))
  Chromosome  Start  End Strand Chromosome_b  Start_b  End_b Strand_b  Distance
0       chr1      5   10      +         chr1        1      5        +         1
>>> f1.nearest_ranges(f2, direction='downstream')
  index  |    Chromosome      Start      End  Strand      Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    category        int64    int64  category    str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  ----------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +           chr1                   20       22  +                   15
      2  |    chr1                8        9  +           chr1                   20       22  +                   12
      1  |    chr1                5        7  -           chr1                    6        7  -                    0
PyRanges with 3 rows, 9 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

If an input interval has no suitable nearest interval, these rows are dropped:

>>> f1.nearest_ranges(f2, direction='upstream', exclude_overlaps=True)
  index  |    Chromosome      Start      End  Strand      Chromosome_b      Start_b    End_b  Strand_b      Distance
  int64  |    category        int64    int64  category    str                 int64    int64  str              int64
-------  ---  ------------  -------  -------  ----------  --------------  ---------  -------  ----------  ----------
      0  |    chr1                3        6  +           chr1                    1        2  +                    2
      2  |    chr1                8        9  +           chr1                    1        2  +                    7
PyRanges with 2 rows, 9 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
outer_ranges(group_by: str | Iterable[str] | None = None, use_strand: Literal['auto'] | bool = 'auto') PyRanges

Return the boundaries (the minimum start and end) of groups of intervals (e.g. transcripts).

Parameters:
  • group_by (str or list of str or None) – Name(s) of column(s) to group intervals (e.g. into multi-exon transcripts) If None, intervals are grouped by chromosome, and strand if present and valid (see .strand_valid).

  • use_strand ({"auto", True, False}, default: "auto") – Whether to cluster only intervals on the same strand. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

Returns:

One interval per group, with the min(Start) and max(End) of the group

Return type:

PyRanges

See also

PyRanges.complement_ranges

return the internal complement of intervals, i.e. its introns.

Examples

>>> gr = pr.PyRanges({"Chromosome": [1, 1, 1], "Start": [1, 60, 110], "End": [40, 68, 130],
...                   "transcript_id": ["tr1", "tr1", "tr2"], "meta": ["a", "b", "c"]})
>>>
>>> gr["Length"] = gr.lengths()
>>> gr
  index  |      Chromosome    Start      End  transcript_id    meta      Length
  int64  |           int64    int64    int64  str              str        int64
-------  ---  ------------  -------  -------  ---------------  ------  --------
      0  |               1        1       40  tr1              a             39
      1  |               1       60       68  tr1              b              8
      2  |               1      110      130  tr2              c             20
PyRanges with 3 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.outer_ranges("transcript_id")
  index  |      Chromosome    Start      End  transcript_id
  int64  |           int64    int64    int64  str
-------  ---  ------------  -------  -------  ---------------
      0  |               1        1       68  tr1
      1  |               1      110      130  tr2
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.outer_ranges()
  index  |      Chromosome    Start      End
  int64  |           int64    int64    int64
-------  ---  ------------  -------  -------
      0  |               1        1      130
PyRanges with 1 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
overlap(other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto', slack: int = 0, *, multiple: bool = False, contained_intervals_only: bool = False, match_by: str | Iterable[str] | None = None, invert: bool = False, preserve_input_order: bool = True) PyRanges

Return overlapping intervals.

Returns the intervals in self which overlap with those in other.

Parameters:
  • other (PyRanges) – PyRanges to find overlaps with.

  • strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.

  • slack (int, default 0) – Intervals in self are temporarily extended by slack on both ends before overlap is calculated, so that we allow non-overlapping intervals to be considered overlapping if they are within less than slack distance e.g. slack=1 reports bookended intervals.

  • multiple (bool, default False) – What intervals to report when multiple intervals in ‘other’ overlap with the same interval in self. If True, each interval is reported once for every overlap, potentially resulting in duplicate indices.

  • contained_intervals_only (bool, default False) – Whether to report only intervals that are entirely contained in an interval of ‘other’.

  • match_by (str or list, default None) – If provided, only overlapping intervals with an equal value in column(s) match_by are reported.

  • invert (bool, default False) – If True, return intervals that do not overlap instead, according to all criteria specified

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A PyRanges with overlapping intervals.

Return type:

PyRanges

See also

PyRanges.intersect_overlaps

report overlapping subintervals

PyRanges.set_intersect_overlaps

set-intersect PyRanges (e.g. merge then intersect)

Examples

>>> gr = pr.PyRanges({"Chromosome": ["chr1", "chr1", "chr2", "chr1", "chr3"], "Start": [1, 1, 4, 10, 0],
...                    "End": [3, 3, 9, 11, 1], "ID": ["A", "a", "b", "c", "d"]})
>>> gr2 = pr.PyRanges({"Chromosome": ["chr1", "chr1", "chr2"], "Start": [2, 2, 1], "End": [3, 9, 10]})
>>> gr
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  A
      1  |    chr1                1        3  a
      2  |    chr2                4        9  b
      3  |    chr1               10       11  c
      4  |    chr3                0        1  d
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 3 chromosomes.
>>> gr2
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                2        3
      1  |    chr1                2        9
      2  |    chr2                1       10
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 2 chromosomes.
>>> gr.overlap(gr2)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  A
      1  |    chr1                1        3  a
      2  |    chr2                4        9  b
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 2 chromosomes.
>>> gr.overlap(gr2, multiple=True)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  A
      0  |    chr1                1        3  A
      1  |    chr1                1        3  a
      1  |    chr1                1        3  a
      2  |    chr2                4        9  b
PyRanges with 5 rows, 4 columns, and 1 index columns (with 2 index duplicates).
Contains 2 chromosomes.
>>> a = pr.PyRanges({"Chromosome": ["chr1", "chr1"], "Start": [5, 1], "End": [7, 3], "ID": ["A", "B"]})
>>> b = pr.PyRanges({"Chromosome": ["chr1", "chr1"], "Start": [2, 6], "End": [4, 8]})
>>> a.overlap(b, multiple=True)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                5        7  A
      1  |    chr1                1        3  B
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> a.overlap(b, multiple=True, preserve_input_order=False)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      1  |    chr1                1        3  B
      0  |    chr1                5        7  A
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.overlap(gr2, invert=True)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      3  |    chr1               10       11  c
      4  |    chr3                0        1  d
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 2 chromosomes.
>>> gr.overlap(gr2, slack=2)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  A
      1  |    chr1                1        3  a
      2  |    chr2                4        9  b
      3  |    chr1               10       11  c
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 2 chromosomes.
>>> gr.overlap(gr2, slack=2, invert=True)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      4  |    chr3                0        1  d
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.overlap(gr2, contained_intervals_only=True)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      2  |    chr2                4        9  b
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.overlap(gr2, contained_intervals_only=True, invert=True)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  A
      1  |    chr1                1        3  a
      3  |    chr1               10       11  c
      4  |    chr3                0        1  d
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 2 chromosomes.
>>> gr.overlap(gr2, contained_intervals_only=True, slack=-2)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  A
      1  |    chr1                1        3  a
      2  |    chr2                4        9  b
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 2 chromosomes.
>>> gr3 = pr.PyRanges({"Chromosome": 1, "Start": [2, 4], "End": [3, 5], "Strand": ["+", "-"]})
>>> gr3
  index  |      Chromosome    Start      End  Strand
  int64  |           int64    int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |               1        2        3  +
      1  |               1        4        5  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr4 = pr.PyRanges({"Chromosome": 1, "Start": [0], "End": [10], "Strand": ["-"]})
>>> gr3.overlap(gr4, strand_behavior="opposite")
  index  |      Chromosome    Start      End  Strand
  int64  |           int64    int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |               1        2        3  +
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
remove_nonloc_columns() PyRanges

Remove all columns that are not genome location columns (Chromosome, Start, End, Strand).

Examples

>>> gr = pr.PyRanges({"Chromosome": [1], "Start": [895], "Strand": ["+"], "Score": [1], "Score2": [2], "End": [1259]})
>>> gr
  index  |      Chromosome    Start  Strand      Score    Score2      End
  int64  |           int64    int64  str         int64     int64    int64
-------  ---  ------------  -------  --------  -------  --------  -------
      0  |               1      895  +               1         2     1259
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> gr.remove_nonloc_columns()
  index  |      Chromosome    Start      End  Strand
  int64  |           int64    int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |               1      895     1259  +
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
remove_strand() PyRanges

Return a copy with the Strand column removed.

Strand is removed regardless of whether it contains valid strand info.

See also

PyRanges.strand_valid

whether PyRanges has valid strand info

PyRanges.invert_strand

invert plus <-> minus Strand in all intervals

Examples

>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr1'], 'Start': [1, 6],
...                   'End': [5, 8], 'Strand': ['+', '-']})
>>> gr
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        5  +
      1  |    chr1                6        8  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.remove_strand()
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1        5
      1  |    chr1                6        8
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
set_intersect_overlaps(other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto', *, multiple: Literal['first', 'all', 'last', 'contained'] = 'all', preserve_input_order: bool = True) PyRanges

Return set-theoretical intersection.

Like intersect_overlaps, but both PyRanges are merged first.

Parameters:
  • other (PyRanges) – PyRanges to set-intersect.

  • strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.

  • multiple ({"all", "first", "last"}, default "all") – What to report when multiple merged intervals in ‘other’ overlap with the same merged interval in self. The default “all” reports all overlapping subintervals. “first” reports only, for each merged self interval, the overlapping ‘other’ subinterval with smallest Start “last” reports only the overlapping subinterval with the biggest End in ‘other’

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A PyRanges with overlapping subintervals. Input index is not preserved. No columns other than Chromosome, Start, End, and optionally Strand are returned.

Return type:

PyRanges

See also

PyRanges.set_union_overlaps

set-theoretical union

PyRanges.intersect_overlaps

find overlapping subintervals

PyRanges.overlap

report overlapping intervals

PyRanges.merge_overlaps

merge overlapping intervals

Examples

>>> r1 = pr.PyRanges({"Chromosome": ["chr1"] * 3, "Start": [5, 20, 40],"End": [10, 30, 50], "ID": ["a", "b", "c"]})
>>> r1
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                5       10  a
      1  |    chr1               20       30  b
      2  |    chr1               40       50  c
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> r2 = pr.PyRanges({"Chromosome": ["chr1"] * 4, "Start": [7, 18, 25, 28], "End": [9, 22, 33, 32]})
>>> r2
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                7        9
      1  |    chr1               18       22
      2  |    chr1               25       33
      3  |    chr1               28       32
PyRanges with 4 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> r1.set_intersect_overlaps(r2, multiple='first')
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                7        9
      1  |    chr1               20       22
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> r1.set_intersect_overlaps(r2)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                7        9
      1  |    chr1               20       22
      2  |    chr1               25       30
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
set_union_overlaps(other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto') PyRanges

Return set-theoretical union.

Returns the regions present in either self or other. Both PyRanges are merged first.

Parameters:
  • other (PyRanges) – PyRanges to do union with.

  • strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.

Returns:

A PyRanges with the union of intervals. Input index is not preserved. No columns other than Chromosome, Start, End, and optionally Strand are returned.

Return type:

PyRanges

See also

PyRanges.set_intersect_overlaps

set-theoretical intersection

PyRanges.overlap

report overlapping intervals

PyRanges.merge_overlaps

merge overlapping intervals

Examples

>>> gr = pr.PyRanges({"Chromosome": ["chr1"] * 3, "Start": [1, 4, 10],
...                    "End": [3, 9, 11], "ID": ["a", "b", "c"]})
>>> gr
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  a
      1  |    chr1                4        9  b
      2  |    chr1               10       11  c
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr2 = pr.PyRanges({"Chromosome": ["chr1"] * 3, "Start": [2, 2, 9], "End": [3, 9, 10]})
>>> gr2
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                2        3
      1  |    chr1                2        9
      2  |    chr1                9       10
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.set_union_overlaps(gr2)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1        9
      1  |    chr1                9       10
      2  |    chr1               10       11
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

Merging bookended intervals:

>>> gr.set_union_overlaps(gr2).merge_overlaps(slack=1)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1       11
PyRanges with 1 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
slice_ranges(start: int | Sequence[int] | ndarray = 0, end: int | Sequence[int] | ndarray | None = None, group_by: str | Iterable[str] | None = None, use_strand: Literal['auto'] | bool = 'auto', *, count_introns: bool = False, preserve_input_order: bool = True) PyRanges

Return sub-intervals of self, cut according to start and end.

The slice window can be a single pair (scalar start, end) that is applied to every row, or one value per row supplied as a 1-D sequence or NumPy array. When vectors are used their length must equal the number of rows in self.

A positive coordinate is counted from the 5’ end (left end on plus strand, right end on minus strand). A negative coordinate is counted from the 3’ end (for example -1 is the last nucleotide). end=None means “up to the 3’ end”.

Parameters:
  • start (int or 1-D array-like of int, default 0) – Inclusive start offset.

  • end (int or 1-D array-like of int or None, default None) – Exclusive end offset. None means the existing 3’ end.

  • group_by (str or list of str, optional) – Column name(s) that define groups (for example exons in one transcript). If given, slicing is performed on the spliced transcript; introns are ignored unless count_introns is True.

  • use_strand ({"auto", True, False}, default "auto") – If True the 5’/3’ logic honours the strand column. If False every interval is treated as plus strand. “auto” selects True when the PyRanges has valid strand information.

  • count_introns (bool, default False) – If False (default) start and end refer to spliced coordinates (introns ignored). If True they refer to unspliced coordinates.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A new PyRanges object with the requested sub-intervals.

Return type:

PyRanges

Notes

  • Out-of-bounds requests are silently truncated to the existing span.

  • Negative offsets are resolved after the transcript length is known, so they always count from the 3’ end of the (spliced or unspliced) interval in question.

See also

PyRanges.window_ranges

divide intervals into windows

Examples

>>> p  = pr.PyRanges({"Chromosome": [1, 1, 2, 2, 3],
...                   "Strand": ["+", "+", "-", "-", "+"],
...                   "Start": [1, 40, 10, 70, 140],
...                   "End": [11, 60, 25, 80, 152],
...                   "transcript_id":["t1", "t1", "t2", "t2", "t3"] })
>>> p
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               1       11  t1
      1  |               1  +              40       60  t1
      2  |               2  -              10       25  t2
      3  |               2  -              70       80  t2
      4  |               3  +             140      152  t3
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Get the first 5 nucleotides of each interval, counting from the 5’ end:

>>> p.slice_ranges(0, 5)
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               1        6  t1
      1  |               1  +              40       45  t1
      2  |               2  -              20       25  t2
      3  |               2  -              75       80  t2
      4  |               3  +             140      145  t3
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Get the last 10 nucleotides of each interval. End is omitted to get the existing 3’ end:

>>> p.slice_ranges(-10)
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               1       11  t1
      1  |               1  +              50       60  t1
      2  |               2  -              10       20  t2
      3  |               2  -              70       80  t2
      4  |               3  +             142      152  t3
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Get the first 15 nucleotides of each spliced transcript, grouping exons by transcript_id:

>>> p.slice_ranges(0, 15, group_by='transcript_id')
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               1       11  t1
      1  |               1  +              40       45  t1
      2  |               2  -              20       25  t2
      3  |               2  -              70       80  t2
      4  |               3  +             140      152  t3
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Get the last 20 nucleotides of each spliced transcript:

>>> p.slice_ranges(-20, group_by='transcript_id')
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      1  |               1  +              40       60  t1
      2  |               2  -              10       25  t2
      3  |               2  -              70       75  t2
      4  |               3  +             140      152  t3
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Use use_strand=False to treat all intervals as if they were on the + strand:

>>> p.slice_ranges(0, 15, group_by='transcript_id', use_strand=False)
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               1       11  t1
      1  |               1  +              40       45  t1
      2  |               2  -              10       25  t2
      4  |               3  +             140      152  t3
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Get region from 25 to 60 of each spliced transcript, or their existing subportion:

>>> p.slice_ranges(25, 60, group_by='transcript_id')
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      1  |               1  +              55       60  t1
PyRanges with 1 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

Get region of each spliced transcript which excludes their first and last 3 nucleotides:

>>> p.slice_ranges(3, -3, group_by='transcript_id')
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               4       11  t1
      1  |               1  +              40       57  t1
      2  |               2  -              13       25  t2
      3  |               2  -              70       77  t2
      4  |               3  +             143      149  t3
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Considering input start,end to refer to the unspliced transcript, i.e. counting introns. This fetches all interval portions that overlap with the first 50nt of each transcript:

>>> p.slice_ranges(0, 50, group_by='transcript_id', count_introns=True)
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               1       11  t1
      1  |               1  +              40       51  t1
      3  |               2  -              70       80  t2
      4  |               3  +             140      152  t3
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.
>>> p.slice_ranges(0, 50, group_by='transcript_id', count_introns=True, use_strand=False)
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               1       11  t1
      1  |               1  +              40       51  t1
      2  |               2  -              10       25  t2
      4  |               3  +             140      152  t3
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.
>>> p.slice_ranges(-50, -5, group_by='transcript_id', count_introns=True)
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +              10       11  t1
      1  |               1  +              40       55  t1
      2  |               2  -              15       25  t2
      4  |               3  +             140      147  t3
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.
sort_ranges(by: str | Iterable[str] | None = None, *, natsort: bool = True, use_strand: Literal['auto'] | bool = 'auto') PyRanges

Sort PyRanges according to Chromosome, Strand (if present), Start, and End; or by the specified columns.

If PyRanges is stranded and use_strand is True, intervals on the negative strand are sorted in descending order, and End is considered before Start. This is to have a 5’ to 3’ order. For uses not covered by this function, use DataFrame.sort_values().

Parameters:
  • by (str or list of str, default None) – If provided, sorting occurs by Chromosome, Strand (if present), *by, Start, and End.

  • use_strand ({"auto", True, False}, default: "auto") – Whether negative strand intervals should be sorted in descending order, meaning 5’ to 3’. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

  • natsort (bool, default True) – Whether to use natural sorting for Chromosome column, so that e.g. chr2 < chr11.

Returns:

Sorted PyRanges. The index is preserved. Use .reset_index(drop=True) to reset the index.

Return type:

PyRanges

Examples

>>> p = pr.PyRanges({"Chromosome": ["chr1", "chr1", "chr1", "chr1", "chr2", "chr11", "chr11", "chr1"],
...                  "Strand": ["+", "+", "-", "-", "+", "+", "+",  "+"],
...                  "Start": [40, 1, 10, 70, 300, 140, 160, 90],
...                  "End": [60, 11, 25, 80, 400, 152, 190, 100],
...                  "transcript_id":["t3", "t3", "t2", "t2", "t4", "t5", "t5", "t1"]})
>>> p
  index  |    Chromosome    Strand      Start      End  transcript_id
  int64  |    str           str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |    chr1          +              40       60  t3
      1  |    chr1          +               1       11  t3
      2  |    chr1          -              10       25  t2
      3  |    chr1          -              70       80  t2
      4  |    chr2          +             300      400  t4
      5  |    chr11         +             140      152  t5
      6  |    chr11         +             160      190  t5
      7  |    chr1          +              90      100  t1
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.
>>> p.sort_ranges(natsort=False)
  index  |    Chromosome    Strand      Start      End  transcript_id
  int64  |    str           str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      1  |    chr1          +               1       11  t3
      0  |    chr1          +              40       60  t3
      7  |    chr1          +              90      100  t1
      3  |    chr1          -              70       80  t2
      2  |    chr1          -              10       25  t2
      5  |    chr11         +             140      152  t5
      6  |    chr11         +             160      190  t5
      4  |    chr2          +             300      400  t4
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Do not sort negative strand intervals in descending order:

>>> p.sort_ranges(use_strand=False, natsort=False)
  index  |    Chromosome    Strand      Start      End  transcript_id
  int64  |    str           str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      1  |    chr1          +               1       11  t3
      0  |    chr1          +              40       60  t3
      7  |    chr1          +              90      100  t1
      2  |    chr1          -              10       25  t2
      3  |    chr1          -              70       80  t2
      5  |    chr11         +             140      152  t5
      6  |    chr11         +             160      190  t5
      4  |    chr2          +             300      400  t4
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Sort chromosomes in natural order:

>>> p.sort_ranges()
  index  |    Chromosome    Strand      Start      End  transcript_id
  int64  |    str           str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      1  |    chr1          +               1       11  t3
      0  |    chr1          +              40       60  t3
      7  |    chr1          +              90      100  t1
      3  |    chr1          -              70       80  t2
      2  |    chr1          -              10       25  t2
      4  |    chr2          +             300      400  t4
      5  |    chr11         +             140      152  t5
      6  |    chr11         +             160      190  t5
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Sort by ‘transcript_id’ before than by columns Start and End (but after Chromosome and Strand):

>>> p.sort_ranges(by='transcript_id', natsort=False)
  index  |    Chromosome    Strand      Start      End  transcript_id
  int64  |    str           str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      7  |    chr1          +              90      100  t1
      1  |    chr1          +               1       11  t3
      0  |    chr1          +              40       60  t3
      3  |    chr1          -              70       80  t2
      2  |    chr1          -              10       25  t2
      5  |    chr11         +             140      152  t5
      6  |    chr11         +             160      190  t5
      4  |    chr2          +             300      400  t4
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Sort by ‘transcript_id’ before than by columns Strand, Start and End:

>>> res = p.sort_ranges(natsort=False)
>>> res.sort_values("transcript_id", kind="stable")
  index  |    Chromosome    Strand      Start      End  transcript_id
  int64  |    str           str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      7  |    chr1          +              90      100  t1
      3  |    chr1          -              70       80  t2
      2  |    chr1          -              10       25  t2
      1  |    chr1          +               1       11  t3
      0  |    chr1          +              40       60  t3
      4  |    chr2          +             300      400  t4
      5  |    chr11         +             140      152  t5
      6  |    chr11         +             160      190  t5
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.

Same as before, but ‘transcript_id’ is sorted in descending order:

>>> res = p.sort_ranges(natsort=False)
>>> res.sort_values("transcript_id", kind="stable", ascending=False)
  index  |    Chromosome    Strand      Start      End  transcript_id
  int64  |    str           str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      5  |    chr11         +             140      152  t5
      6  |    chr11         +             160      190  t5
      4  |    chr2          +             300      400  t4
      1  |    chr1          +               1       11  t3
      0  |    chr1          +              40       60  t3
      3  |    chr1          -              70       80  t2
      2  |    chr1          -              10       25  t2
      7  |    chr1          +              90      100  t1
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.
split_overlaps(use_strand: Literal['auto'] | bool = 'auto', *, match_by: str | Iterable[str] | None = None, between: bool = False) PyRanges

Split into non-overlapping intervals.

The output does not contain overlapping intervals, but intervals that are adjacent are not merged. No columns other than Chromosome, Start, End, and Strand (if present) are output.

Parameters:
  • use_strand ({"auto", True, False}, default: "auto") – Whether to split only intervals on the same strand. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

  • between (bool, default False) – Output also intervals corresponding to the gaps between the intervals in self.

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be split.

Returns:

PyRanges with intervals split at overlap points.

Return type:

PyRanges

See also

PyRanges.merge_overlaps

merge overlapping intervals

PyRanges.max_disjoint_overlaps

find the maximal disjoint set of intervals

Examples

>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr1', 'chr1', 'chr1'], 'Start': [3, 5, 5, 11],
...                   'End': [6, 9, 7, 12], 'Strand': ['+', '+', '-', '-']})
>>> gr
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        6  +
      1  |    chr1                5        9  +
      2  |    chr1                5        7  -
      3  |    chr1               11       12  -
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.split_overlaps()
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        5  +
      1  |    chr1                5        6  +
      2  |    chr1                6        9  +
      3  |    chr1                5        7  -
      4  |    chr1               11       12  -
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.split_overlaps(between=True)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                3        5  +
      1  |    chr1                5        6  +
      2  |    chr1                6        9  +
      3  |    chr1                5        7  -
      4  |    chr1                7       11  -
      5  |    chr1               11       12  -
PyRanges with 6 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.split_overlaps(use_strand=False)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                3        5
      1  |    chr1                5        6
      2  |    chr1                6        7
      3  |    chr1                7        9
      4  |    chr1               11       12
PyRanges with 5 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.split_overlaps(use_strand=False, between=True)
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                3        5
      1  |    chr1                5        6
      2  |    chr1                6        7
      3  |    chr1                7        9
      4  |    chr1                9       11
      5  |    chr1               11       12
PyRanges with 6 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr['ID'] = ['a', 'b', 'a', 'c']
>>> gr
  index  |    Chromosome      Start      End  Strand    ID
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -----
      0  |    chr1                3        6  +         a
      1  |    chr1                5        9  +         b
      2  |    chr1                5        7  -         a
      3  |    chr1               11       12  -         c
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.split_overlaps(use_strand=False, match_by='ID')
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                3        5  a
      1  |    chr1                5        6  a
      2  |    chr1                6        7  a
      3  |    chr1                5        9  b
      4  |    chr1               11       12  c
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
property strand_valid: bool

Whether PyRanges has valid strand info.

Values other than ‘+’ and ‘-’ in the Strand column are not considered valid. A PyRanges without a Strand column is also not considered to have valid strand info.

See also

PyRanges.has_strand

whether a Strand column is present

PyRanges.make_strand_valid

make the strand information in PyRanges valid

Examples

>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr1'], 'Start': [1, 6],
...                   'End': [5, 8], 'Strand': ['+', '.']})
>>> gr
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        5  +
      1  |    chr1                6        8  .
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands (including non-genomic strands: .).
>>> gr.strand_valid  # invalid strand value: '.'
False
>>> "Strand" in gr.columns
True
subtract_overlaps(other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto', *, match_by: str | Iterable[str] | None = None, preserve_input_order: bool = True) PyRanges

Subtract intervals, i.e. return non-overlapping subintervals.

Identify intervals in other that overlap with intervals in self; return self with the overlapping parts removed.

Parameters:
  • other – PyRanges to subtract.

  • strand_behavior ("auto", "same", "opposite", "ignore") – How to handle strand information. “auto” means use “same” if both PyRanges are stranded, otherwise ignore the strand information. “same” means only subtract intervals on the same strand. “opposite” means only subtract intervals on the opposite strand. “ignore” means ignore strand

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

PyRanges with subintervals from self that do not overlap with any interval in other. Columns and index are preserved.

Return type:

PyRanges

Warning

The returned Pyranges may have index duplicates. Call .reset_index(drop=True) to fix it.

See also

PyRanges.overlap

use with invert=True to return all intervals without overlap

PyRanges.complement_ranges

return the internal complement of intervals, i.e. its introns.

Examples

>>> gr = pr.PyRanges({"Chromosome": ["chr1"] * 3, "Start": [1, 4, 10],
...                    "End": [3, 9, 11], "ID": ["a", "b", "c"]})
>>> gr2 = pr.PyRanges({"Chromosome": ["chr1"] * 3, "Start": [2, 2, 9], "End": [3, 9, 10]})
>>> gr
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        3  a
      1  |    chr1                4        9  b
      2  |    chr1               10       11  c
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr2
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                2        3
      1  |    chr1                2        9
      2  |    chr1                9       10
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.subtract_overlaps(gr2)
  index  |    Chromosome      Start      End  ID
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                1        2  a
      2  |    chr1               10       11  c
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr['tag'] = ['x', 'y', 'z']
>>> gr2['tag'] = ['x', 'w', 'z']
>>> gr
  index  |    Chromosome      Start      End  ID     tag
  int64  |    str             int64    int64  str    str
-------  ---  ------------  -------  -------  -----  -----
      0  |    chr1                1        3  a      x
      1  |    chr1                4        9  b      y
      2  |    chr1               10       11  c      z
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr2
  index  |    Chromosome      Start      End  tag
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  -----
      0  |    chr1                2        3  x
      1  |    chr1                2        9  w
      2  |    chr1                9       10  z
PyRanges with 3 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.subtract_overlaps(gr2, match_by="tag")
  index  |    Chromosome      Start      End  ID     tag
  int64  |    str             int64    int64  str    str
-------  ---  ------------  -------  -------  -----  -----
      0  |    chr1                1        2  a      x
      1  |    chr1                4        9  b      y
      2  |    chr1               10       11  c      z
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes.
summary(*, return_df: bool = False) DataFrame | None

Return a summary of info regarding this PyRanges object.

In output, the row “count” refers to thenumber of intervals and “sum” to their total length. The rest describe the distribution of lengths of the intervals.

The column “pyrange” describes the data as is. “coverage_forward” and “coverage_reverse” describe the data after strand-specific merging of overlapping intervals. “coverage_unstranded” describes the data after merging, without considering the strands.

Parameters:

return_df (bool, default False) – Return df with summary.

Return type:

None or pd.DataFrame with summary.

Examples

>>> gr = pr.example_data.ensembl_gtf.get_with_loc_columns(["Feature", "gene_id"])
>>> gr
index    |    Chromosome    Start    End      Strand      Feature     gene_id
int64    |    category      int64    int64    category    category    str
-------  ---  ------------  -------  -------  ----------  ----------  ---------------
0        |    1             11868    14409    +           gene        ENSG00000223972
1        |    1             11868    14409    +           transcript  ENSG00000223972
2        |    1             11868    12227    +           exon        ENSG00000223972
3        |    1             12612    12721    +           exon        ENSG00000223972
...      |    ...           ...      ...      ...         ...         ...
7        |    1             120724   133723   -           transcript  ENSG00000238009
8        |    1             133373   133723   -           exon        ENSG00000238009
9        |    1             129054   129223   -           exon        ENSG00000238009
10       |    1             120873   120932   -           exon        ENSG00000238009
PyRanges with 11 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.summary()
         pyrange    coverage_forward    coverage_reverse    coverage_unstranded
-----  ---------  ------------------  ------------------  ---------------------
count      11                      1                3                      4
mean     1893.27                2541             4503                   4012.5
std      3799.24                 nan             7359.28                6088.38
min        59                   2541              105                    105
25%       139                   2541              255                    330
50%       359                   2541              405                   1473
75%      1865                   2541             6702                   5155.5
max     12999                   2541            12999                  12999
sum     20826                   2541            13509                  16050
>>> gr.summary(return_df=True)
            pyrange  coverage_forward  coverage_reverse  coverage_unstranded
count     11.000000               1.0          3.000000             4.000000
mean    1893.272727            2541.0       4503.000000          4012.500000
std     3799.238610               NaN       7359.280671          6088.379834
min       59.000000            2541.0        105.000000           105.000000
25%      139.000000            2541.0        255.000000           330.000000
50%      359.000000            2541.0        405.000000          1473.000000
75%     1865.000000            2541.0       6702.000000          5155.500000
max    12999.000000            2541.0      12999.000000         12999.000000
sum    20826.000000            2541.0      13509.000000         16050.000000
three_end(group_by: str | Iterable[str] | None = None, ext: int = 0) PyRanges

Return the three prime end of intervals.

The three prime end is the end of a forward strand or the start of a reverse strand. All returned intervals have length of 1.

Parameters:
  • group_by (str or list of str, default: None) – Optional column name(s). If provided, the three prime end is calculated for each group of intervals (e.g. for each transcript).

  • ext (int, default 0) – Lengthen the resulting intervals on both ends by this amount.

Returns:

PyRanges with the three prime ends

Return type:

PyRanges

See also

PyRanges.upstream

return regions upstream of input intervals or transcripts

PyRanges.five_end

return the 5’ end of intervals or transcripts

PyRanges.extend_ranges

return intervals or transcripts extended at one or both ends

Examples

>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 10, 5], 'End': [9, 14, 7],
...                    'Strand': ["+", "+", "-"], 'Name': ['a', 'a', 'b']})
>>> gr
  index  |    Chromosome      Start      End  Strand    Name
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |    chr1                3        9  +         a
      1  |    chr1               10       14  +         a
      2  |    chr1                5        7  -         b
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.three_end()
  index  |    Chromosome      Start      End  Strand    Name
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      0  |    chr1                8        9  +         a
      1  |    chr1               13       14  +         a
      2  |    chr1                5        6  -         b
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.three_end(group_by='Name')
  index  |    Chromosome      Start      End  Strand    Name
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      1  |    chr1               13       14  +         a
      2  |    chr1                5        6  -         b
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.three_end(group_by='Name', ext=1)
  index  |    Chromosome      Start      End  Strand    Name
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  ------
      1  |    chr1               12       15  +         a
      2  |    chr1                4        7  -         b
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
tile_ranges(tile_size: int, *, use_strand: bool = False, match_by: str | Iterable[str] | None = None, overlap_column: str | None = None) PyRanges

Return overlapping genomic tiles.

The genome is divided into bookended tiles of length tile_size. One tile is returned for each interval that overlaps with it, including any metadata from the original intervals.

Parameters:
  • tile_size (int) – Length of the tiles.

  • overlap_column (str, default None) – Name of column to add with the overlap between each bookended tile.

  • use_strand ({"auto", True, False}, default: "auto") – Whether negative strand intervals should be windowed in reverse order. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

  • match_by (str | Sequence[str] | None, default None) – Column name(s) used to form groups when iterating rows. For tile, grouping does not change which tiles are produced or their overlap fractions: tiles are always taken from a fixed genomic grid with step tile_size (boundaries at multiples of tile_size), and each interval is intersected with that grid independently. Group boundaries only affect iteration order; there is no cross-interval “carry” for tile.

Returns:

Tiled PyRanges.

Return type:

PyRanges

Warning

The returned Pyranges may have index duplicates. Call .reset_index(drop=True) to fix it.

See also

PyRanges.window_ranges

divide intervals into windows

pyranges.tile_genome

divide the genome into tiles

Examples

>>> gr = pr.example_data.ensembl_gtf.get_with_loc_columns(["Feature", "gene_name"])
>>> gr
index    |    Chromosome    Start    End      Strand      Feature     gene_name
int64    |    category      int64    int64    category    category    str
-------  ---  ------------  -------  -------  ----------  ----------  -----------
0        |    1             11868    14409    +           gene        DDX11L1
1        |    1             11868    14409    +           transcript  DDX11L1
2        |    1             11868    12227    +           exon        DDX11L1
3        |    1             12612    12721    +           exon        DDX11L1
...      |    ...           ...      ...      ...         ...         ...
7        |    1             120724   133723   -           transcript  AL627309.1
8        |    1             133373   133723   -           exon        AL627309.1
9        |    1             129054   129223   -           exon        AL627309.1
10       |    1             120873   120932   -           exon        AL627309.1
PyRanges with 11 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.tile_ranges(200)
index    |    Chromosome    Start    End      Strand      Feature     gene_name
int64    |    category      int64    int64    category    category    str
-------  ---  ------------  -------  -------  ----------  ----------  -----------
0        |    1             11800    12000    +           gene        DDX11L1
0        |    1             12000    12200    +           gene        DDX11L1
0        |    1             12200    12400    +           gene        DDX11L1
0        |    1             12400    12600    +           gene        DDX11L1
...      |    ...           ...      ...      ...         ...         ...
8        |    1             133600   133800   -           exon        AL627309.1
9        |    1             129000   129200   -           exon        AL627309.1
9        |    1             129200   129400   -           exon        AL627309.1
10       |    1             120800   121000   -           exon        AL627309.1
PyRanges with 116 rows, 6 columns, and 1 index columns (with 105 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> gr.tile_ranges(100, overlap_column="TileOverlap")
index    |    Chromosome    Start    End      Strand      Feature     gene_name    TileOverlap
int64    |    category      int64    int64    category    category    str          float64
-------  ---  ------------  -------  -------  ----------  ----------  -----------  -------------
0        |    1             11800    11900    +           gene        DDX11L1      0.32
0        |    1             11900    12000    +           gene        DDX11L1      1.0
0        |    1             12000    12100    +           gene        DDX11L1      1.0
0        |    1             12100    12200    +           gene        DDX11L1      1.0
...      |    ...           ...      ...      ...         ...         ...          ...
9        |    1             129100   129200   -           exon        AL627309.1   1.0
9        |    1             129200   129300   -           exon        AL627309.1   0.23
10       |    1             120800   120900   -           exon        AL627309.1   0.27
10       |    1             120900   121000   -           exon        AL627309.1   0.32
PyRanges with 223 rows, 7 columns, and 1 index columns (with 212 index duplicates).
Contains 1 chromosomes and 2 strands.
to_bed(path: str | None = None, compression: Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd'] | dict[str, Any] | None = None, *, keep: bool = True) str | None

Write to bed.

Parameters:
  • path (str, default None) – Where to write. If None, returns string representation.

  • keep (bool, default True) – Whether to keep all columns, not just Chromosome, Start, End, Name, Score, Strand when writing.

  • compression (str, compression type to use, by default infer based on extension.) – See pandas.DataFree.to_csv for more info.

Examples

>>> d =  {'Chromosome': ['chr1', 'chr1'], 'Start': [1, 6],
...       'End': [5, 8], 'Strand': ['+', '-'], "Gene": [1, 2]}
>>> gr = pr.PyRanges(d)
>>> gr
  index  |    Chromosome      Start      End  Strand       Gene
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1                1        5  +               1
      1  |    chr1                6        8  -               2
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.to_bed()
'chr1\t1\t5\t.\t.\t+\t1\nchr1\t6\t8\t.\t.\t-\t2\n'

File contents:

chr1        1       5       .       .       +       1
chr1        6       8       .       .       -       2

Does not include noncanonical bed-column Gene:

>>> gr.to_bed(keep=False)
'chr1\t1\t5\t.\t.\t+\nchr1\t6\t8\t.\t.\t-\n'

File contents:

chr1        1       5       .       .       +
chr1        6       8       .       .       -
>>> gr.to_bed("test.bed")
>>> open("test.bed").readlines()
['chr1\t1\t5\t.\t.\t+\t1\n', 'chr1\t6\t8\t.\t.\t-\t2\n']
to_bigwig(path: None = None, chromosome_sizes: DataFrame | dict | None = None, value_col: str | None = None, *, divide: bool = False, rpm: bool = True, return_data=False) PyRanges | None

Compute coverage (interval-based, or using a numerical value column) and write to bigwig.

Computes a score per position; by default, it is the number of intervals spanning that position. If value_col is provided, the score is the sum of values of all intervals spanning that position. The score per position is then reduced to a minimal number of ranges with constant coverage (i.e. like a run-length encoding), and written in bigwig format to the provided path.

Note

To create one bigwig per strand, subset the PyRanges first into two separate objects (positive and negative strands).

Parameters:
  • path (str | None, default None) – Where to write bigwig. If None, return_data must be True.

  • chromosome_sizes (PyRanges or dict or None, default None) – Chromosome sizes to use. If provided, the output bigwig will span the entire chromosomes as given here. If dict, it must be a map of chromosome names to chromosome length. If a PyRanges, it must have ‘Chromosome’ and ‘End’ columns, where ‘End’ gives the chromosome length. If None, the maximum end position per chromosome in the input PyRanges is used.

  • value_col (str, default None) – Name of column to compute coverage of. If None, compute coverage (i.e. number of intervals spanning each position).

  • rpm (True) – Whether to normalize data by dividing by total number of intervals and multiplying by 1e6.

  • divide (bool, default False) – (Only useful with value_col) Divide value coverage by regular coverage and take log2.

  • return_data (bool, default False) – Whether to return the data that would be written to bigwig as a PyRanges.

Note

Requires pybigwig and pyrle to be installed.

Examples

>>> d =  {'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [1, 4, 6],
...       'End': [7, 8, 10], 'Strand': ['+', '-', '-'],
...       'Value': [10, 20, 30]}
>>> gr = pr.PyRanges(d)
>>> gr
  index  |    Chromosome      Start      End  Strand      Value
  int64  |    str             int64    int64  str         int64
-------  ---  ------------  -------  -------  --------  -------
      0  |    chr1                1        7  +              10
      1  |    chr1                4        8  -              20
      2  |    chr1                6       10  -              30
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr.to_bigwig(return_data=True, rpm=False)
  index  |    Chromosome      Start      End      Score
  int64  |    category        int64    int64    float64
-------  ---  ------------  -------  -------  ---------
      1  |    chr1                1        4          1
      2  |    chr1                4        6          2
      3  |    chr1                6        7          3
      4  |    chr1                7        8          2
      5  |    chr1                8       10          1
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.to_bigwig(return_data=True, rpm=False, value_col="Value")
  index  |    Chromosome      Start      End      Score
  int64  |    category        int64    int64    float64
-------  ---  ------------  -------  -------  ---------
      1  |    chr1                1        4         10
      2  |    chr1                4        6         30
      3  |    chr1                6        7         60
      4  |    chr1                7        8         50
      5  |    chr1                8       10         30
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.to_bigwig(return_data=True, rpm=False, value_col="Value", divide=True)
  index  |    Chromosome      Start      End      Score
  int64  |    category        int64    int64    float64
-------  ---  ------------  -------  -------  ---------
      0  |    chr1                0        1  nan
      1  |    chr1                1        4    3.32193
      2  |    chr1                4        6    3.90689
      3  |    chr1                6        7    4.32193
      4  |    chr1                7        8    4.64386
      5  |    chr1                8       10    4.90689
PyRanges with 6 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes.
to_gff3(path: None = None, compression: Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd'] | dict[str, Any] | None = None, map_cols: dict | None = None) str | None

Write to General Feature Format 3.

The GFF format consists of a tab-separated file without header. GFF contains a fixed amount of columns, indicated below (names before “:”). For each of these, PyRanges will use the corresponding column (names after “:”).

  • seqname: Chromosome

  • source: Source

  • feature: Feature

  • start: Start

  • end: End

  • score: Score

  • strand: Strand

  • phase: Frame

  • attribute: auto-filled

Columns which are not mapped to GFF columns are appended as a field in the attribute string (i.e. the last field).

Parameters:
  • path (str, default None, i.e. return string representation.) – Where to write file.

  • compression ({'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default "infer") – Which compression to use. Uses file extension to infer by default.

  • map_cols (dict, default None) – Override mapping between GTF and PyRanges fields for any number of columns. Format: {gtf_column : pyranges_column} If a mapping is found for the “attribute”` column, it is not auto-filled

Note

Nonexisting columns will be added with a ‘.’ to represent the missing values.

See also

pyranges.read_gff3

read GFF3 files

pyranges.to_gtf

write to GTF format

Examples

>>> d = {"Chromosome": [1] * 3, "Start": [1, 3, 5], "End": [4, 6, 9], "Feature": ["gene", "exon", "exon"]}
>>> gr = pr.PyRanges(d)
>>> gr.to_gff3()
'1\t.\tgene\t2\t4\t.\t.\t.\t\n1\t.\texon\t4\t6\t.\t.\t.\t\n1\t.\texon\t6\t9\t.\t.\t.\t\n'

How the file would look:

1   .       gene    2       4       .       .       .
1   .       exon    4       6       .       .       .
1   .       exon    6       9       .       .       .
>>> gr["Gene"] = [1, 2, 3]
>>> gr["function"] = ["a b", "c", "def"]
>>> gr.to_gff3()
'1\t.\tgene\t2\t4\t.\t.\t.\tGene=1;function=a b\n1\t.\texon\t4\t6\t.\t.\t.\tGene=2;function=c\n1\t.\texon\t6\t9\t.\t.\t.\tGene=3;function=def\n'

How the file would look:

1   .       gene    2       4       .       .       .       Gene=1;function=a b
1   .       exon    4       6       .       .       .       Gene=2;function=c
1   .       exon    6       9       .       .       .       Gene=3;function=def
>>> gr["phase"] = [0, 2, 1]
>>> gr["Feature"] = ['mRNA', 'CDS', 'CDS']
>>> gr
  index  |      Chromosome    Start      End  Feature       Gene  function      phase
  int64  |           int64    int64    int64  str          int64  str           int64
-------  ---  ------------  -------  -------  ---------  -------  ----------  -------
      0  |               1        1        4  mRNA             1  a b               0
      1  |               1        3        6  CDS              2  c                 2
      2  |               1        5        9  CDS              3  def               1
PyRanges with 3 rows, 7 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.to_gff3()
'1\t.\tmRNA\t2\t4\t.\t.\t0\tGene=1;function=a b\n1\t.\tCDS\t4\t6\t.\t.\t2\tGene=2;function=c\n1\t.\tCDS\t6\t9\t.\t.\t1\tGene=3;function=def\n'

How the file would look:

1   .       mRNA    2       4       .       .       0       Gene=1;function=a b
1   .       CDS     4       6       .       .       2       Gene=2;function=c
1   .       CDS     6       9       .       .       1       Gene=3;function=def
>>> gr['custom'] = ['AA', 'BB', 'CC']
>>> gr
  index  |      Chromosome    Start      End  Feature       Gene  function      phase  custom
  int64  |           int64    int64    int64  str          int64  str           int64  str
-------  ---  ------------  -------  -------  ---------  -------  ----------  -------  --------
      0  |               1        1        4  mRNA             1  a b               0  AA
      1  |               1        3        6  CDS              2  c                 2  BB
      2  |               1        5        9  CDS              3  def               1  CC
PyRanges with 3 rows, 8 columns, and 1 index columns.
Contains 1 chromosomes.
>>> print(gr.to_gff3(map_cols={"feature": "custom"})) 
1       .       AA      2       4       .       .       0       Feature=mRNA;Gene=1;function=a b
1       .       BB      4       6       .       .       2       Feature=CDS;Gene=2;function=c
1       .       CC      6       9       .       .       1       Feature=CDS;Gene=3;function=def
>>> print(gr.to_gff3(map_cols={"attribute": "custom"})) 
1       .       mRNA    2       4       .       .       0       AA
1       .       CDS     4       6       .       .       2       BB
1       .       CDS     6       9       .       .       1       CC
to_gtf(path: None = None, compression: Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd'] | dict[str, Any] | None = None, map_cols: dict | None = None) str | None

Write to Gene Transfer Format.

The GTF format consists of a tab-separated file without header. It contains a fixed amount of columns, indicated below (names before “:”). For each of these, PyRanges will use the corresponding column (names after “:”).

  • seqname: Chromosome

  • source: Source

  • feature: Feature

  • start: Start

  • end: End

  • score: Score

  • strand: Strand

  • frame: Frame

  • attribute: auto-filled

Columns which are not mapped to GTF columns are appended as a field in the attribute string (i.e. the last field).

Parameters:
  • path (str, default None, i.e. return string representation.) – Where to write file.

  • compression ({'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default "infer") – Which compression to use. Uses file extension to infer by default.

  • map_cols (dict, default None) – Override mapping between GTF and PyRanges fields for any number of columns. Format: {gtf_column : pyranges_column} If a mapping is found for the “attribute”` column, it is not auto-filled

Note

Nonexisting columns will be added with a ‘.’ to represent the missing values.

See also

pyranges.read_gtf

read GTF files

pyranges.to_gff3

write to GFF3 format

Examples

>>> d = {"Chromosome": [1] * 3, "Start": [1, 3, 5], "End": [4, 6, 9], "Feature": ["gene", "exon", "exon"]}
>>> gr = pr.PyRanges(d)
>>> gr.to_gtf()  # the raw string output
'1\t.\tgene\t2\t4\t.\t.\t.\t\n1\t.\texon\t4\t6\t.\t.\t.\t\n1\t.\texon\t6\t9\t.\t.\t.\t\n'

What the file contents look like:

1   .       gene    2       4       .       .       .
1   .       exon    4       6       .       .       .
1   .       exon    6       9       .       .       .
>>> gr.Feature = ["GENE", "EXON", "EXON"]
>>> gr.to_gtf()  # the raw string output
'1\t.\tGENE\t2\t4\t.\t.\t.\t\n1\t.\tEXON\t4\t6\t.\t.\t.\t\n1\t.\tEXON\t6\t9\t.\t.\t.\t\n'

The file would look like:

1   .       GENE    2       4       .       .       .
1   .       EXON    4       6       .       .       .
1   .       EXON    6       9       .       .       .
>>> gr["tag"] = [11, 22, 33]
>>> gr
  index  |      Chromosome    Start      End  Feature        tag
  int64  |           int64    int64    int64  str          int64
-------  ---  ------------  -------  -------  ---------  -------
      0  |               1        1        4  GENE            11
      1  |               1        3        6  EXON            22
      2  |               1        5        9  EXON            33
PyRanges with 3 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes.
>>> print(gr.to_gff3()) 
1       .       GENE    2       4       .       .       .       tag=11
1       .       EXON    4       6       .       .       .       tag=22
1       .       EXON    6       9       .       .       .       tag=33
>>> print(gr.to_gff3(map_cols={'seqname':'tag'})) 
11      .       GENE    2       4       .       .       .       Chromosome=1
22      .       EXON    4       6       .       .       .       Chromosome=1
33      .       EXON    6       9       .       .       .       Chromosome=1
>>> print(gr.to_gff3(map_cols={'attribute':'tag'})) 
1       .       GENE    2       4       .       .       .       11
1       .       EXON    4       6       .       .       .       22
1       .       EXON    6       9       .       .       .       33
to_rle(value_col: str | None = None, strand: Literal['auto'] | bool = 'auto', *, rpm: bool = False) Rledict

Return as Rledict.

Create collection of Rles representing the coverage or other numerical value.

Parameters:
  • value_col (str, default None) – Numerical column to create Rledict from.

  • strand (bool, default None, i.e. auto) – Whether to treat strands serparately.

  • rpm (bool, default False) – Normalize by multiplying with 1e6/(number_intervals).

Returns:

Rle with coverage or other info from the PyRanges.

Return type:

pyrle.Rledict

upstream(length: int, gap: int = 0, *, group_by: str | Iterable[str] | None = None, use_strand: Literal['auto'] | bool = 'auto') PyRanges

Return regions upstream (at the 5’ side) of input intervals.

Parameters:
  • length (int) – Size of the region (bp), > 0.

  • gap (int, default 0) – Distance between region and input intervals; use negative to include some overlap.

  • group_by (str or list of str or None) – Name(s) of column(s) to group intervals. If provided, one region per group (e.g. transcript) is returned.

  • use_strand ({"auto", True, False}, default: "auto") – Whether to consider strand; if so, the upstream window of negative intervals is on their right. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

See also

PyRanges.downstream

return regions downstream of input intervals or transcripts

PyRanges.five_end

return the 5’ end of intervals or transcripts

PyRanges.extend_ranges

return intervals or transcripts extended at one or both ends

PyRanges.slice_ranges

obtain subsequences of intervals, providing transcript-level coordinates

Examples

>>> a = pr.PyRanges({'Chromosome':['chr1','chr1'],
...                  'Start':[100,200],'End':[120,220],
...                  'Strand':['+','-']})
>>> a
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1              100      120  +
      1  |    chr1              200      220  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Default window (10 bp) right at the border:

>>> a.upstream(10)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1               90      100  +
      1  |    chr1              220      230  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

With a 5 bp gap:

>>> a.upstream(10, gap=5)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1               85       95  +
      1  |    chr1              225      235  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

With a 5 bp overlap (negative gap):

>>> a.upstream(10, gap=-5)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1               95      105  +
      1  |    chr1              215      225  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Transcript-aware example (two 2-exon transcripts):

>>> ex = pr.PyRanges({'Chromosome':['chr1']*4,
...                   'Start':[0,10,30,50],'End':[5,15,40,60],
...                   'Strand':['+','+','-','-'],
...                   'Tx':['tx1','tx1','tx2','tx2']})
>>> ex
  index  |    Chromosome      Start      End  Strand    Tx
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -----
      0  |    chr1                0        5  +         tx1
      1  |    chr1               10       15  +         tx1
      2  |    chr1               30       40  -         tx2
      3  |    chr1               50       60  -         tx2
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Note that upstream regions may extend beyond the start of the chromosome, resulting in invalid ranges. See clip_ranges() to fix this.

>>> ex.upstream(5, group_by='Tx')
  index  |    Chromosome      Start      End  Strand    Tx
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -----
      0  |    chr1               -5        0  +         tx1
      3  |    chr1               60       65  -         tx2
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
Invalid ranges:
  * 1 starts or ends are < 0. See indexes: 0
window_ranges(window_size: int, use_strand: Literal['auto'] | bool = 'auto', group_by: str | Iterable[str] | None = None, *, add_window_id: bool = False) PyRanges

Return intervals sliced in non-overlapping windows.

Every interval is split into windows of length window_size starting from its 5’ end.

Parameters:
  • window_size (int) – Length of the windows.

  • use_strand ({"auto", True, False}, default: "auto") – Whether negative strand intervals should be sliced in descending order, meaning 5’ to 3’. The default “auto” means True if PyRanges has valid strands (see .strand_valid).

  • group_by (str | Sequence[str] | None, default None) – Column name(s) used to form groups. If provided, windowing proceeds continuously within each group: any leftover (partial) window at the end of one interval is continued at the start of the next interval with the same group_by value. The window “phase” resets at group boundaries.

  • add_window_id (bool, default False) – Use only in combination with group_by. If True, adds a column “window_id” with an index-like identifier to link the windows split in non-contigous intervals, e.g. because a certain window spans an intron. The window_id starts at 0, and resets for each group.

Returns:

Sliding window PyRanges.

Return type:

PyRanges

Warning

The returned Pyranges may have index duplicates. Call .reset_index(drop=True) to fix it. Moreover, the input row order may not be preserved.

See also

PyRanges.tile_ranges

divide intervals into adjacent tiles.

Examples

>>> import pyranges1 as pr
>>> gr = pr.PyRanges({"Chromosome": [1], "Start": [800], "End": [1012]})
>>> gr
  index  |      Chromosome    Start      End
  int64  |           int64    int64    int64
-------  ---  ------------  -------  -------
      0  |               1      800     1012
PyRanges with 1 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
>>> gr.window_ranges(100)
  index  |      Chromosome    Start      End
  int64  |           int64    int64    int64
-------  ---  ------------  -------  -------
      0  |               1      800      900
      0  |               1      900     1000
      0  |               1     1000     1012
PyRanges with 3 rows, 3 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes.
>>> gr.window_ranges(100).reset_index(drop=True)
  index  |      Chromosome    Start      End
  int64  |           int64    int64    int64
-------  ---  ------------  -------  -------
      0  |               1      800      900
      1  |               1      900     1000
      2  |               1     1000     1012
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

Negative strand intervals are sliced in descending order by default:

>>> gs = pr.PyRanges({"Chromosome": [1, 1], "Start": [200, 600], "End": [332, 787], "Strand":['+', '-']})
>>> gs
  index  |      Chromosome    Start      End  Strand
  int64  |           int64    int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |               1      200      332  +
      1  |               1      600      787  -
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> w = gs.window_ranges(100)
>>> w['lengths'] = w.lengths() # add lengths column to see the length of the windows
>>> w
  index  |      Chromosome    Start      End  Strand      lengths
  int64  |           int64    int64    int64  str           int64
-------  ---  ------------  -------  -------  --------  ---------
      0  |               1      200      300  +               100
      0  |               1      300      332  +                32
      1  |               1      687      787  -               100
      1  |               1      600      687  -                87
PyRanges with 4 rows, 5 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> gs.window_ranges(100, use_strand=False)
  index  |      Chromosome    Start      End  Strand
  int64  |           int64    int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |               1      200      300  +
      0  |               1      300      332  +
      1  |               1      600      700  -
      1  |               1      700      787  -
PyRanges with 4 rows, 4 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> gr2 = pr.example_data.ensembl_gtf.get_with_loc_columns(["Feature", "gene_name"])
>>> gr2
index    |    Chromosome    Start    End      Strand      Feature     gene_name
int64    |    category      int64    int64    category    category    str
-------  ---  ------------  -------  -------  ----------  ----------  -----------
0        |    1             11868    14409    +           gene        DDX11L1
1        |    1             11868    14409    +           transcript  DDX11L1
2        |    1             11868    12227    +           exon        DDX11L1
3        |    1             12612    12721    +           exon        DDX11L1
...      |    ...           ...      ...      ...         ...         ...
7        |    1             120724   133723   -           transcript  AL627309.1
8        |    1             133373   133723   -           exon        AL627309.1
9        |    1             129054   129223   -           exon        AL627309.1
10       |    1             120873   120932   -           exon        AL627309.1
PyRanges with 11 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr2 = pr.example_data.ensembl_gtf.get_with_loc_columns(["Feature", "gene_name"])
>>> gr2.window_ranges(1000)
index    |    Chromosome    Start    End      Strand      Feature     gene_name
int64    |    category      int64    int64    category    category    str
-------  ---  ------------  -------  -------  ----------  ----------  -----------
0        |    1             11868    12868    +           gene        DDX11L1
0        |    1             12868    13868    +           gene        DDX11L1
0        |    1             13868    14409    +           gene        DDX11L1
1        |    1             11868    12868    +           transcript  DDX11L1
...      |    ...           ...      ...      ...         ...         ...
7        |    1             120724   121723   -           transcript  AL627309.1
8        |    1             133373   133723   -           exon        AL627309.1
9        |    1             129054   129223   -           exon        AL627309.1
10       |    1             120873   120932   -           exon        AL627309.1
PyRanges with 28 rows, 6 columns, and 1 index columns (with 17 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> gr3 = pr.PyRanges({'Chromosome':1, 'Strand':list('+++--'), 'Start':[10, 30, 50, 70, 90], 'End':[20, 40, 60, 80, 100], 'ID':list('aaabb')})
>>> gr3
  index  |      Chromosome  Strand      Start      End  ID
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  -----
      0  |               1  +              10       20  a
      1  |               1  +              30       40  a
      2  |               1  +              50       60  a
      3  |               1  -              70       80  b
      4  |               1  -              90      100  b
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr3.window_ranges(8, group_by='ID')
index    |    Chromosome    Strand    Start    End      ID
int64    |    int64         str       int64    int64    str
-------  ---  ------------  --------  -------  -------  -----
0        |    1             +         10       18       a
0        |    1             +         18       20       a
1        |    1             +         30       36       a
1        |    1             +         36       40       a
...      |    ...           ...       ...      ...      ...
3        |    1             -         74       80       b
3        |    1             -         70       74       b
4        |    1             -         92       100      b
4        |    1             -         90       92       b
PyRanges with 10 rows, 5 columns, and 1 index columns (with 5 index duplicates).
Contains 1 chromosomes and 2 strands.
>>> gr4 = pr.PyRanges({'Chromosome':2, 'Strand':list('+++--'), 'Start':[30,10,50,90,70],
...                    'End':[40,20,60,100,80], 'ID':['id1','id1','id1','id2','id2']})
>>> gr4
  index  |      Chromosome  Strand      Start      End  ID
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  -----
      0  |               2  +              30       40  id1
      1  |               2  +              10       20  id1
      2  |               2  +              50       60  id1
      3  |               2  -              90      100  id2
      4  |               2  -              70       80  id2
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr4.window_ranges(8, group_by='ID').reset_index(drop=True).head(8)
  index  |      Chromosome  Strand      Start      End  ID
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  -----
      0  |               2  +              10       18  id1
      1  |               2  +              18       20  id1
      2  |               2  +              30       36  id1
      3  |               2  +              36       40  id1
      4  |               2  +              50       54  id1
      5  |               2  +              54       60  id1
      6  |               2  -              74       80  id2
      7  |               2  -              70       74  id2
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
>>> gr4.window_ranges(8, group_by='ID', add_window_id=True).reset_index(drop=True).head(8)
  index  |      Chromosome  Strand      Start      End  ID       window_id
  int64  |           int64  str         int64    int64  str          int64
-------  ---  ------------  --------  -------  -------  -----  -----------
      0  |               2  +              10       18  id1              1
      1  |               2  +              18       20  id1              2
      2  |               2  +              30       36  id1              2
      3  |               2  +              36       40  id1              3
      4  |               2  +              50       54  id1              3
      5  |               2  +              54       60  id1              4
      6  |               2  -              74       80  id2              1
      7  |               2  -              70       74  id2              2
PyRanges with 8 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.