Method
To be part of the TFC2 collection, genes from the original sources have to meet the following criteria: (1) be classified as dbTF, coTF or GTF according to the curation principles of these sources, (2) be from human, mouse or rat origin and (3) be mapped to the UniProt Knowledgebase (UniProtKB), to be more specific, to the manually annotated and reviewed Swiss-Prot section of UniProtKB. Data were collected by first scanning the biomedical literature in search of updated versions of data sources already present in the original version of TFCheckpoint, and to identify new sources dedicated to cataloging human, mouse and rat dbTFs, coTFs and GTFs. With regard to the GO database, we retrieved dbTFs, coTFs and GTFs proteins associated with the following molecular terms and children thereof: “DNA-binding transcription factor activity” (GO:0003700) for dbTFs, “transcription coregulator activity” (GO:0003712) for coTFs and “general transcription initiation factor activity” (GO:0140223) supported by any type of evidence. Upon selecting the sources, we downloaded the TFs list-containing text files from the source websites or used the supplementary files accompanying the original publication (Table 1). In some sources (ORFeome, Vaquerizas, Ravasi, TcoF-DB, Animal TFDB, Saeed, Lambert & Jolma, Lovering and TFClass), the files were promptly available from download, while in others (JASPAR, GO database), we had to provide specific queries via REST APIs to retrieve the lists of interest. Except for TFClass, all sources provided files specific for a TF type, i.e. dbTF, coTF and GTF, and organism, i.e., human, mouse and rat. Regarding TFClass, we had to use an in-house script to parse the provided turtle file.
As the DBD and TFCat databases are currently inaccessible, we had to take these collections from TFCheckpoint 1.0. By using in-house Ruby scripts, the lists were individually translated from their original identifiers into gene symbols, Entrez GeneID, Ensembl and Uniprot IDs and appropriately merged into one master table showing the presence and absence of TFs per data source. Entries missing any type of identifier were removed. Finally, we connected human, mouse and rat proteins by way of orthology information obtained from the OrthoDB database (v10.1). In brief, we used OrthoDB’s SPARQL endpoint to obtain all human, mouse and rat ortholog genes at the mammalian level present in OrthoDB. We then translated all entries from their original identifiers (Entrez Gene IDs) into gene symbols, Ensembl and Uniprot IDs and merged them into one master table. Finally, we discarded entries if none of their orthologs could be mapped to UniProtKB/Swiss-Prot.