Notes on keeping (statistical) indices
From Livetrix
(See also background querying and fetching)
Contents |
[edit] Avoiding duplicates
The what-if idea here is if we keep track of what documents we see we can identify when we see them again.
This would be a useful ability to avoid recording things twice in training, and also to mark duplicates in merged results. Both problems are likely to likely to happen if a document is recent, and a source and/or subject is popular.
It could potentially even be used for features such as alerting users to new articles are available from a particular journal - although that can also be done with regular searching and separate tables, or external to this setup. A rating and review option would also be helped by this since you cannot show these details in the results unless you're fairly sure when to add them.
[edit] Abstracting documents?
Trying to abstract out a document is not easy - that is, being able to match any record from any source to a particular record in a list of things not in any way tied to source-specific identification.
Ideally the records we get are either books or articles in serials. Books can be handled by ISBN alone, forgetting edition problems for a second. Serials would probably be imitation of what we humans might look for - a serial (ISSN) in a year, an article title, and if that's not clear enough, adding a main author. Most of the room for error is in what differs per source - title augmentation, and in electronic-library practice, also author formatting. Other problems include things with very similar titles, like the records of the 32nd and 33st conference of something or other.
These are not the only problems. Not all articles have this information, providing mostly description and little to no identification. A few sources, like Elsevier, e.g. its Science Direct (something similar goes for SCOPUS), do this consistently, never provide more than title, author and the occasional journal title, but no ISSN or other citation. This seems to be a "our site or bust" move, even though OpenURL resolvers like SFX can work with just the journal title too, so full-text lookups via SFX still work. Even so, this sort of variation makes almost any smart setup less than robust.
[edit] Possible means of identification, record comparison
Things like ISBNs, LCCNs and OCLC numbers are large-scale universal identifiers of books, items+concepts, and items+concepts, respectively.
Generally, the client-side identification of a target plus the target-side local identification gives a good means of record identification within a target, but obviously not across targets.
Record comparison has to work differently, as it will likely have to be a little fuzzy.
In the absence of identifiers, record comparison should probably be executed based primarily on:
- Year
- ISBN
- ISSN
- Author(s) (main)
- Title (not always useful, e.g. with serial congress titles)
That is:
- For books, the same ISBN and very similar author gives good confidence that something is the same book.
- For articles, ISSN, volume, issue, and very similar title gives good confidence that something is the same article.
But note:
- Care should be taken when comparing an empty field with a nonempty one since sometimes this is significant, and sometimes it is not.
- ISBN may also appear in reviews of a book and so only suggests strong similarity, but not necessarily equivalence of records. In record comparison, identical ISBN and dissimilar authors should be taken as non-identical records.
- ISSNs are effectively identifiers on collections of articles, but one logical publication may easily have two ISSNs (one for electronic and one for print). This is not interesting for record comparison (or even identification, depending on your model), so ISSN and therefore article equivalence needs to compare ISSNs using a database instead of a direct string comparison.
- Year is fairly obvious. Inequality lessens similarity, (sometimes one year difference may happen for the same item for publications/releases around new year's, but you can generally assume this doesn't happen)
- Author names need some help. That is, the name normalization makes it easier to use just the last name, which should show less variation than the user of initials (quite a few citations choose to omit middle-name initials).
- Title is probably the hardest. You would have to consider things like longest substring match, and test whether what is left is likely to be a significant difference (a month, year, or issue number) which would indicate things like congresses/meetings in different years.
[edit] ...preferably not
Given this variation, another simpler but more solid option is to do is to any uniqueness guaranteeing only from the same source - much simpler, since records should stay unchanged, at least in their major fields. It would avoid double training of re-requested data from the same source only, which will have to be good enough.
This is a much more tractible solution, as trying to identify documents across sources also adds problem cases such as finding rather similar minimal records most humans coun't say for sure match or not, or finding a much more or much less detailed one that may or may not be the same thing. Would you ignore things without ISSNs? That may categorically ignore chunks of sources, many resource types, and some sources entirely. When you do it per source logic here just becomes simpler.
[edit] The logic
That leaves logic to decide whether something needs to be added, ignored (or possibly updated).
The simplest alternative is setting a unique constraint on a column tuple, probably (db,IS[SB]N,year,title,author) and see how much that catches.
If you also want to allow for minor changes in some fields, you need a point at which to be forgiving, and can handle the difference between ISSN/ISBN. Actually, it may well be worth it to eliminate these duplicates after the fact with a script that runs only every now and then so that this can be tweaked as necessary, where 'as necssary' is 'not much'; metadata should not change much unless the entire database underlying a source is reorganized or changed.
The basic pseudocode would be something like:
if inrecord(isbn):
if isbn in db:
ignore/update
else:
add record
else if issn in record:
if nothing in db with same yr and issn:
add record
else:
get list of things that may match (based on same yr, issn)
if something matches close enough:
ignore/update
else:
add record
else
get list of things that may match (author, title)
if something matches close enough
ignore/update
else
add record
Of course, how 'matches close enough' is decided is still the whole print. While year, ISSN and ISBN being not exactly equal are sure signs of their respective records being different, other fields like main author, title, other authors are not, for reasons already mentioned. They still are, but in lesser, different, context and semantic-dependant ways; fuzzy information, which is why it is probably not ideal to build this into the basic background fetcher.
This is not implemented. Right now, there is an SQL-based uniqueness constraint on target, year, serial number and the first so-many characters of the title. The only real effect this has is detecting duplicates from the same source. That works fairly well, although the index this requires is fairly huge, on the same order of magnitude as the data.
[edit] Avoiding bias
One major goal of this system is training, which more or less has to be incremental based on the data it sees coming by. This leads to a few very basic bias problems, and depending on decisions, some specific ones.
The most important problem is having duplicate information, in this case of even just very basic article metadata. If sources or searches are popular, the same records will turn up again and again. The only real way to avoid counting these disproportionally much is to keep track of what records we have seen.
[edit] Representative selection
It is hard to take a representative selection from each source. This is one of the reasons you would use a learning system - you really can't do as well manually.
You basically want a distribution of words within a source. In this way, the keyword representation of a source will grow to be approximately correct, also because a wide selection of its articles become known
[edit] Source size
You also want word distributions to compare well between sources.
This is a larger practical problem, because it is easy for sources with a lot of records to overpower the smaller ones.
The question is whether this is a feature or a bug - it is of course not an unmitigatedly bad thing that things that provide a lot of hits for a phrase turn up.
However, if a smaller source has fewer terms at all, but for a specific term has as many counts as a larger source, it is probably a specialized source, and you would want to show it. This last consideration is one for source selection time, but it does mean the data you collect would have to contain this information.
[edit] Source selection
Related is users' source selection, both automatic and manual.
Automatic source selection, even if immediately ideal for users, will mean some sources will always be searched in more than others. The same goes for manual selection, particularly if, as in combine, the predefined sets are much, much easier to select.
The point is that if you learn from every search, you will favour popular sources. Again, the question is whether this is a feature or a bug; more information about larger and usual sources is a good thing, it is more the possibility popularity is misplaced that is a bug.
Even so, you do not want to leave sources behind. One possible equalizer here here is to do every query in all sources.
There is still some downside; the automatic source selection will generally augment with general sources, and both that and users doing the same searches again (queries don't see an unbiased distribution, of course) will lead to double fetching, which would bias towards popular user sources as well as sources that were learned of first, mainly as a result of users.
But this is for the next section:
[edit] Fetched-record preference
As of this writing, the background fetcher fetches a few hundred records per search per source to base statistics on.
This would bias towards a few things, such as things fetched more than once, but this is avoided.
More importantly, it will bias towards result sorting, which theorhetically is arbitrary, but usually favours recent records.
This is more problematic in larger databases, where a few hundred records may be only a week's worth of records, whereas in others this may span years. This date bias could be relieved somewhat by having the fetcher fetch records from all over the results, i.e. adaptive to result amount, but without fetching more records. This should lead to a better selection in the long term.
The effect can also be branded a feature rather than a bug by arguing people usually want recent articles, and will search more specifically when they do not. More importantly, however, the effect will lessen with more diverse searches.
[edit] Search term focus
When a search is made, the fetched records should all include that word.
This is logical, but may mean that relatively rare terms will seem like they occur as much as more common terms that are not specifically searched for and only turn up incidentally. Like other things, this effect should lessen over time, quite simply though collocative use -- there are a *lot* of string in the database by now. It is possible it is not a very large problem.
[edit] Metadata detail
Another problem is that sources differ in the amount and/or diversity in the keywords they return, which is a major field for the larning system, and hence could make some sources overpower others through just how detailed their record is.
This does not necessarily reflect source size, since some of the largest have no keyword fields, but neither do many of the smaller ones. These sources will be represented only by what appears in the record titles. This should be fairly descriptive, and there is little you can do when there essentially is no little to no data to work with.
