Input Data¶
In this library, objects are referred to as variables, and claims are
statements asserting that a given variable takes a given value. Input data is
represented by a Dataset object, which is constructed by passing an
iterable of tuples of the form (source_label, var_label, value)
for each claim that is made.
For example, consider the following situation:
- Source 1 claims X = 4, Y = 7
- Source 2 claims Y = 7, Z = 8
- Source 3 claims X = 3, Z = 5
- Source 4 claims X = 3, Y = 6, Z = 8
This dataset can be constructed as follows.
from truthdiscovery import Dataset
tuples = [
("source 1", "x", 4),
("source 1", "y", 7),
("source 2", "y", 7),
("source 2", "z", 5),
("source 3", "x", 3),
("source 3", "z", 5),
("source 4", "x", 3),
("source 4", "y", 6),
("source 4", "z", 8)
]
mydata = Dataset(tuples)
Note that source labels, variable labels, and variable values can be any types, not just strings/numbers as in the above example (the only caveat is that they must be hashable types so they can be used as dictionary keys).
Data with numeric values only¶
In the above example all values are numeric, so the dataset can alternatively
be created as a MatrixDataset. This is done by giving a matrix where
rows correspond to sources, columns correspond to variables, and an entry at
position (i, j) is the value that source i claims for variable j
(the matrix may contain empty cells in cases where a source does not make a
claim about a variable).
In matrix form, the example is:
where the columns correspond to X, Y and Z respectively. Note that the sources
and variable are not explicitly assigned labels (e.g. source 1, x) as
they are when using the Dataset constructor.
Matrices are representing using numpy’s masked array type; the above example can be constructed as follows.
import numpy.ma as ma
from truthdiscovery import MatrixDataset
mydata = MatrixDataset(ma.masked_values([
[4, 7, 0],
[0, 7, 8],
[3, 0, 5],
[3, 6, 8]
], 0))
CSV format¶
MatrixDataset objects can also be loaded from a file using the
from_csv() method. The
above dataset in CSV format would be:
4,7,
,7,8
3,,5
3,6,8
Implications between claims¶
As well as considering sources and claimed variable values, some algorithms consider implications between claims. The idea is that if a given claim is considered believable, claims that it implies should be considered believable too. Currently TruthFinder [1] is the only algorithm implemented here that considers implications.
To be precise, the implication between claims var = x and var = y is a
value in [-1, 1] that describes how the confidence that var = x influences
the confidence of var = y. A positive value indicates that if var = x
is true, then var = y is likely to be true. A negative value means that if
var = x is true, then var = y is likely to be false [1].
The implication values are domain-specific and need to be given on a
per-dataset basis. They may be based on similarity, where the implication is
close to 1 when x and y are similar and close to -1 when they are
dissimilar. In general claim implications need not be symmetric (i.e. var=x
-> var=y can be different from var=y -> var=x).
In this library implication values can be optionally given by passing a
function for the implication_function argument to the constructor for
Dataset (or its sub-classes). This function should accept arguments
(var, val1, val2) and return a value in [-1, 1], or None to indicate no
implication.
import math
from truthdiscovery import Dataset
tuples = [
("source 1", "x", 4),
("source 1", "y", 7),
("source 2", "y", 7),
("source 2", "z", 5),
("source 3", "x", 3),
("source 3", "z", 5),
("source 4", "x", 3),
("source 4", "y", 6),
("source 4", "z", 8)
]
def imp(var, val1, val2):
# Implication is close to 1 when val1, val2 are close, and goes to -1
# when they are far apart.
#
# Note that this example does not consider the value of `var`. In
# principle the calculation for implication can differ between
# variables.
return 2 * math.exp(-(val1 - val2)**2) - 1
mydata = Dataset(tuples, implication_function=imp)
Datasets with known true values¶
An easy way to evaluate the performance of a truth discovery algorithm is to run it on a dataset for which the true values of some of the variables is already known. A measure of the accuracy of the algorithm can then be computed by considering how many variables the algorithm predicted the correct value (i.e. the most believed value for a variable was the correct one).
To this end, the SupervisedData class stores a Dataset along with
known true variable values as a dictionary in the form
{var_label: true_value, ...}. For example:
from truthdiscovery import SupervisedData
supervised = SupervisedData(mydata, {"x": 4, "y": 5})
# run an algorithm and compute accuracy...
results = myalg.run(supervised.data)
accuracy = supervised.get_accuracy(results)
See get_accuracy()
for a description of how the accuracy calculation is performed.
Supervised data can also be loaded from a matrix in a CSV file. The format is the same as for unsupervised matrix data (see above), but the first row contains the true values.
Synthetic data¶
It is also possible to generate synthetic datasets, where sources, variables and claims are generated randomly according to some given parameters. This provides an easy way to test algorithms on datasets of different sizes, with different distributions for trust among sources, and to test accuracy without collecting real-world data. For example:
import numpy as np
from truthdiscovery import SyntheticData
synth = SyntheticData(
trust=np.random.uniform(size=(4,)),
num_variables=10,
claim_probability=0.5,
domain_size=4
)
See the SyntheticData constructor for an explanation of the available
parameters. The above example creates a dataset with 4 sources (each with trust
value drawn from a uniform distribution on [0, 1]) and 10 variables with values
in {0, 1, 2, 3}, where a source claims a value for roughly half of the
variables.
SyntheticData is a sub-class of SupervisedData (the ‘true’ value
of each variable is generated randomly before source claims are generated), so
accuracy calculations can be performed with synthetic data as shown in the
previous section.
Synthetic data can be exported to CSV (the same format that can be loaded by
from_csv() for
supervised data) with the
to_csv() method.
Custom dataset formats¶
In a real-world application of truth discovery, data will most likely be loaded from a file in a bespoke format. The most suitable format for storing datasets in files may be domain-specific, or the format may be already fixed if applying truth discovery to existing datasets.
For these reasons, this library does not attempt to provide a standard format for loading files from disk (except for the CSV format for matrix datasets described above, which is of limited use in real-world data scenarios where variable values are not always integers).
Instead, there are two helper classes FileDataset and
FileSupervisedData that allow the user to specify only the
format-specific details, and abstract away other details.
For example, suppose mydata.txt contains:
source 1: x=4, y=7
source 2: y=7, x=8
source 3: x=3, z=5
source 4: x=3, y=6, z=8
To load this file we can create a sub-class of FileDataset and implement
the get_tuples() method:
class DemoFileDataset(FileDataset):
def get_tuples(self, fileobj):
"""
Read each line of the file, and extract source label and claims (note
that no error checking is performed, since this is just a demo)
"""
for line in fileobj:
line = line.strip()
source, claims = line.split(": ")
for claim in claims.split(", "):
var, value = claim.split("=")
yield (source, var, value)
get_tuples() simply yields
data tuples of the form required for the Dataset constructor. To load
the file we simply pass the file path to the constructor:
>>> mydata = DemoFileDataset("mydata.txt")
>>> mydata.num_sources
4
>>> mydata.num_variables
3
>>> from truthdiscovery import MajorityVoting
>>> results = MajorityVoting().run(mydata)
>>> results.trust
{'source 1': 1, 'source 2': 1, 'source 3': 1, 'source 4': 1}
>>> results.belief
{'x': {'4': 0.5, '8': 0.5, '3': 1.0}, 'y': {'7': 1.0, '6': 0.5}, 'z': {'5': 0.5, '8': 0.5}}
>>>
The results of Majority voting shows that the data was loaded as expected.
Loading supervised data from a file is similar: we may create a sub-class of
FileSupervisedData and implement
get_pairs(), which
yields pairs (var, true_value). An object is then constructed with:
mysup = DemoSupervisedFileData(dataset, "true_values.txt")
For another example, see stock_dataset.py in the examples directory in
the repository.
References¶
| [1] | (1, 2) X. Yin and J. Han and P. S. Yu, Truth Discovery with Multiple Conflicting Information Providers on the Web. |