.. _input-page:

Input Data
==========

In this library, objects are referred to as *variables*, and *claims* are
statements asserting that a given variable takes a given value. Input data is
represented by a :any:`Dataset` object, which is constructed by passing an
iterable of tuples of the form ``(source_label, var_label, value)``
for each claim that is made.

For example, consider the following situation:

- Source 1 claims X = 4, Y = 7
- Source 2 claims Y = 7, Z = 8
- Source 3 claims X = 3, Z = 5
- Source 4 claims X = 3, Y = 6, Z = 8

This dataset can be constructed as follows. ::

    from truthdiscovery import Dataset
    tuples = [
        ("source 1", "x", 4),
        ("source 1", "y", 7),
        ("source 2", "y", 7),
        ("source 2", "z", 5),
        ("source 3", "x", 3),
        ("source 3", "z", 5),
        ("source 4", "x", 3),
        ("source 4", "y", 6),
        ("source 4", "z", 8)
    ]
    mydata = Dataset(tuples)

Note that source labels, variable labels, and variable values can be any types,
not just strings/numbers as in the above example (the only caveat is that they
must be `hashable <https://docs.python.org/3/glossary.html#term-hashable>`_
types so they can be used as dictionary keys).

..

Data with numeric values only
-----------------------------

In the above example all values are numeric, so the dataset can alternatively
be created as a :any:`MatrixDataset`. This is done by giving a matrix where
rows correspond to sources, columns correspond to variables, and an entry at
position ``(i, j)`` is the value that source ``i`` claims for variable ``j``
(the matrix may contain empty cells in cases where a source does not make a
claim about a variable).

In matrix form, the example is:

.. math::
   \begin{bmatrix}
   4 & 7 & - \\
   - & 7 & 8 \\
   3 & - & 5 \\
   3 & 6 & 8 \\
   \end{bmatrix}

where the columns correspond to X, Y and Z respectively. Note that the sources
and variable are not explicitly assigned labels (e.g. ``source 1``, ``x``) as
they are when using the :any:`Dataset` constructor.

Matrices are representing using numpy's *masked array* type; the above example
can be constructed as follows. ::

   import numpy.ma as ma
   from truthdiscovery import MatrixDataset

   mydata = MatrixDataset(ma.masked_values([
       [4, 7, 0],
       [0, 7, 8],
       [3, 0, 5],
       [3, 6, 8]
   ], 0))

.. _csv-format:

CSV format
~~~~~~~~~~

:any:`MatrixDataset` objects can also be loaded from a file using the
:meth:`~truthdiscovery.input.matrix_dataset.MatrixDataset.from_csv` method. The
above dataset in CSV format would be::

    4,7,
    ,7,8
    3,,5
    3,6,8

Implications between claims
---------------------------

As well as considering sources and claimed variable values, some algorithms
consider *implications between claims*. The idea is that if a given claim is
considered believable, claims that it implies should be considered believable
too. Currently *TruthFinder* [1]_ is the only algorithm implemented here that
considers implications.

To be precise, the implication between claims ``var = x`` and ``var = y`` is a
value in [-1, 1] that describes how the confidence that ``var = x`` influences
the confidence of ``var = y``.  A positive value indicates that if ``var = x``
is true, then ``var = y`` is likely to be true. A negative value means that if
``var = x`` is true, then ``var = y`` is likely to be false [1]_.

The implication values are domain-specific and need to be given on a
per-dataset basis. They may be based on *similarity*, where the implication is
close to 1 when ``x`` and ``y`` are similar and close to -1 when they are
dissimilar. In general claim implications need not be symmetric (i.e. ``var=x
-> var=y`` can be different from ``var=y -> var=x``).

In this library implication values can be optionally given by passing a
function for the ``implication_function`` argument to the constructor for
:any:`Dataset` (or its sub-classes). This function should accept arguments
``(var, val1, val2)`` and return a value in [-1, 1], or None to indicate no
implication.  ::

    import math
    from truthdiscovery import Dataset
    tuples = [
        ("source 1", "x", 4),
        ("source 1", "y", 7),
        ("source 2", "y", 7),
        ("source 2", "z", 5),
        ("source 3", "x", 3),
        ("source 3", "z", 5),
        ("source 4", "x", 3),
        ("source 4", "y", 6),
        ("source 4", "z", 8)
    ]
    def imp(var, val1, val2):
        # Implication is close to 1 when val1, val2 are close, and goes to -1
        # when they are far apart.
        #
        # Note that this example does not consider the value of `var`. In
        # principle the calculation for implication can differ between
        # variables.
        return 2 * math.exp(-(val1 - val2)**2) - 1

    mydata = Dataset(tuples, implication_function=imp)

Datasets with known true values
-------------------------------

An easy way to evaluate the performance of a truth discovery algorithm is to
run it on a dataset for which the true values of some of the variables is
already known. A measure of the *accuracy* of the algorithm can then be
computed by considering how many variables the algorithm predicted the correct
value (i.e. the most believed value for a variable was the correct one).

To this end, the :any:`SupervisedData` class stores a :any:`Dataset` along with
known true variable values as a dictionary in the form
``{var_label: true_value, ...}``. For example: ::

    from truthdiscovery import SupervisedData

    supervised = SupervisedData(mydata, {"x": 4, "y": 5})

    # run an algorithm and compute accuracy...
    results = myalg.run(supervised.data)
    accuracy = supervised.get_accuracy(results)

See :meth:`~truthdiscovery.input.supervised_data.SupervisedData.get_accuracy`
for a description of how the accuracy calculation is performed.

Supervised data can also be loaded from a matrix in a CSV file. The format is
the same as for unsupervised matrix data (see above), but the first row
contains the true values.

Synthetic data
--------------

It is also possible to generate *synthetic datasets*, where sources, variables
and claims are generated randomly according to some given parameters. This
provides an easy way to test algorithms on datasets of different sizes, with
different distributions for trust among sources, and to test accuracy without
collecting real-world data. For example: ::

    import numpy as np
    from truthdiscovery import SyntheticData

    synth = SyntheticData(
        trust=np.random.uniform(size=(4,)),
        num_variables=10,
        claim_probability=0.5,
        domain_size=4
    )

See the :any:`SyntheticData` constructor for an explanation of the available
parameters. The above example creates a dataset with 4 sources (each with trust
value drawn from a uniform distribution on [0, 1]) and 10 variables with values
in ``{0, 1, 2, 3}``, where a source claims a value for roughly half of the
variables.

:any:`SyntheticData` is a sub-class of :any:`SupervisedData` (the 'true' value
of each variable is generated randomly before source claims are generated), so
accuracy calculations can be performed with synthetic data as shown in the
previous section.

Synthetic data can be exported to CSV (the same format that can be loaded by
:meth:`~truthdiscovery.input.supervised_data.SupervisedData.from_csv` for
supervised data) with the
:meth:`~truthdiscovery.input.synthetic_data.SyntheticData.to_csv` method.

Custom dataset formats
----------------------

In a real-world application of truth discovery, data will most likely be loaded
from a file in a bespoke format. The most suitable format for storing datasets
in files may be domain-specific, or the format may be already fixed if applying
truth discovery to existing datasets.

For these reasons, this library does not attempt to provide a standard format
for loading files from disk (except for the CSV format for matrix datasets
described above, which is of limited use in real-world data scenarios where
variable values are not always integers).

Instead, there are two helper classes :any:`FileDataset` and
:any:`FileSupervisedData` that allow the user to specify only the
format-specific details, and abstract away other details.

For example, suppose ``mydata.txt`` contains::

    source 1: x=4, y=7
    source 2: y=7, x=8
    source 3: x=3, z=5
    source 4: x=3, y=6, z=8

To load this file we can create a sub-class of :any:`FileDataset` and implement
the :meth:`~truthdiscovery.input.file_helpers.FileDataset.get_tuples` method::

    class DemoFileDataset(FileDataset):
        def get_tuples(self, fileobj):
            """
            Read each line of the file, and extract source label and claims (note
            that no error checking is performed, since this is just a demo)
            """
            for line in fileobj:
                line = line.strip()
                source, claims = line.split(": ")
                for claim in claims.split(", "):
                    var, value = claim.split("=")
                    yield (source, var, value)

:meth:`~truthdiscovery.input.file_helpers.FileDataset.get_tuples` simply yields
data tuples of the form required for the :any:`Dataset` constructor. To load
the file we simply pass the file path to the constructor::

    >>> mydata = DemoFileDataset("mydata.txt")
    >>> mydata.num_sources
    4
    >>> mydata.num_variables
    3
    >>> from truthdiscovery import MajorityVoting
    >>> results = MajorityVoting().run(mydata)
    >>> results.trust
    {'source 1': 1, 'source 2': 1, 'source 3': 1, 'source 4': 1}
    >>> results.belief
    {'x': {'4': 0.5, '8': 0.5, '3': 1.0}, 'y': {'7': 1.0, '6': 0.5}, 'z': {'5': 0.5, '8': 0.5}}
    >>>

The results of :ref:`majority-voting` shows that the data was loaded as
expected.

Loading supervised data from a file is similar: we may create a sub-class of
:any:`FileSupervisedData` and implement
:meth:`~truthdiscovery.input.file_helpers.FileSupervisedData.get_pairs`, which
yields pairs ``(var, true_value)``. An object is then constructed with::

    mysup = DemoSupervisedFileData(dataset, "true_values.txt")

For another example, see ``stock_dataset.py`` in the ``examples`` directory in
the repository.

References
----------

.. [1] X. Yin and J. Han and P. S. Yu, `Truth Discovery with Multiple Conflicting
   Information Providers on the Web
   <http://ieeexplore.ieee.org/document/4415269/>`_.